Data Analysis with Pandas-(1)-Getting started with matrices

1.
Reading data into NumPy

NumPy is a Python module that has a lot of functions for working with data. If you want to do serious work with data in Python, you‘ll be using a lot of NumPy. We‘ll work through importing NumPy and loading in a csv file.

2.
Fixing the data types

If you looked at the data you read in last screen, you may have noticed that it looked very strange. This is because genfromtxt reads the data into a?NumPy?array. Every element in an array has to be the same data type. So everything is a string, or everything is an integer, and so on. NumPy?tried to convert all of our data to floats, which caused the values to become strange. We‘ll need to specify the data type when we read our data in so we can avoid that.

3. Indexing the data

Now that we know how to read in a file, let‘s start pulling values out. Remember how all elements in a matrix have an index? We can print the item at row 1, column 2, by typing?print world_alcohol[0,1]

4. Vectors

When we grab a whole row or column from the matrix, we actually end up with a vector. Just like a matrix is a 2-dimensional array because it has rows and columns, a vector is a 1-dimensional array. Vectors are similar to Python lists in that they can be indexed with only one number. Think of a vector as just a single row, or a single column.

5. Array shape

All arrays, whether they are 1-dimensional (vectors), two dimensional (matrices), or even larger, have a number of elements in each dimension. For example, a matrix may have 200 rows and 10 columns. We can use the?shape?method to find these dimensions.

6. Boolean elements

We can also use boolean statements on arrays to get truth values. The interesting part about this is that the booleans are computed elementwise.

The above code will actually compare each element of the fourth column of?world_alcohol, check if it equals?"Beer", and create a new vector with the True/False values.

7. Subsets of vectors

We can subset vectors based on boolean vectors like the ones we generated in the last screen.

The code above will select and print only the elements in the fourth column whose value is "Beer". world_alcohol[:,3][beer]?goes through each position in the fourth column vector (from 0 to the last index), and checks if the beer vector is True at the same position. If the beer vector is True, it assigns the element of the fourth column at that position to the subset. If the beer vector is False, the element is skipped.

8. Subsets of matrices

We can subset a matrix in the same way that we can subset a vector.

The above code will print all of the rows in?world_alcohol?where the "Type" column equals?"Beer". Note how because matrices are indexed using two numbers, we are substituting the boolean vector?beer?for the first number. We can alter the second number to select different columns.

The above code would select the second column where the "Type" column equals?"Beer".

9. Subsets with multiple conditions

So now we can find all of the rows that correspond to?"Algeria", for example. But what if what we really want is to find all the rows for?"Algeria"?in?"1985"?

We‘ll have to use multiple conditions to generate our vector.

The code above will generate a boolean that uses multiple conditions. How it works is that the parentheses specify that the two component vectors should be generated first. (order of operations)Then the two vectors will be compared index by index. If both vectors are True at index 1, then the resulting vector will be True at index 1. If either vector is False at index 1, the result will be False at index 1. Here‘s an expanded example:

We can add more than 2 conditions if we want -- we just have to put an?&?symbol between each one. The resulting vector will contain?True?in the position corresponding to rows where all conditions are True, and?False for rows where any condition is False.

10. Convert a column to floats

We now know almost everything we need to compute how much alcohol the people in a country drank in a given year! But there are a couple of things we need to work through first. First, we need to convert the?"Liters of alcohol drunk"?column (the fifth one) to floats. We need to do this because they are?strings?now, and we can‘t take the sum of strings. Because they aren‘t numeric, their sum wouldn‘t make much sense. We can use the?astype?method on the array to do this.

11. Replace values in an array

There are values in our alcohol consumption column that are preventing us from converting the column from floats to strings. In order to fix this, we first have to learn how to replace values. We can replace values in a?NumPy array?by just assigning to them with the equals sign.

The code above will replace any item in the alcohol consumption column that contains ‘0‘ (remember that the world alcohol matrix is all?string?values) with ‘10‘.

12. Convert the alcohol consumption column to floats

Now that you know what the bad value is, we can replace it and then convert the column to floats.

13. Compute the total alcohol consumption

We can compute the total value of a column using the?sum?method.

14.?Finding how much alcohol a person in a country drank in a year

We can subset a vector with another vector, as we learned earlier. This means that we can find the total alcohol consumed by any given country in any given year now.

15. A function to sum yearly alcohol consumption

Now that we know how to find the total alcohol consumption of the average person in a country in a given year, we can make a function out of it. A function will make it easier for us to calculate the alcohol consumption for all countries.

?16. Finding the country that drinks the least

We can now loop over our dictionary keys to find the country with the lowest amount of alcohol consumed per person in 1989.

时间: 2024-10-18 04:26:14

Data Analysis with Pandas-(1)-Getting started with matrices的相关文章

Data Analysis with Pandas 2

1. pandas.csv_read() to read the .csv file. After read, it is automatically convert DataFrame 2.The DataFrame is the frame for Pandas. It is not a matrix. The first column is not the column name but the first row of data. Column name is different fro

Data Analysis with Pandas 3

1. For searching certain row in certain column. We use name["column_name"][row_index] to locate the certain data in the DataFrame.

Data Analysis with Pandas 1

1. NumPy: NumPy is a Python module that is used to create and manipulate multidimensional arrays. 2. genfromtxt() : Function of reading dataset in NumPy numpy.genfromtxt numpy.genfromtxt(fname, dtype=<type 'float'>, comments='#', delimiter=None, ski

Data Analysis with Pandas 4

1. When ever we would like to assign an array into a Series, we need to use [[]] instead [] 2. double_df = float_df.apply(lambda x: x*2)# use apply() to double each element in the Series 3. The axis argument for apply() is to indicate whether we want

Python For Data Analysis -- Pandas

首先pandas的作者就是这本书的作者 对于Numpy,我们处理的对象是矩阵 pandas是基于numpy进行封装的,pandas的处理对象是二维表(tabular, spreadsheet-like),和矩阵的区别就是,二维表是有元数据的 用这些元数据作为index更方便,而Numpy只有整形的index,但本质是一样的,所以大部分操作是共通的 大家碰到最多的二维表应用,关系型数据库中的表,有列名和行号,这些就是元数据 当然你可以用抽象的矩阵来对这些二维表做统计,但使用pandas会更方便  

用pandas进行数据清洗(二)(Data Analysis Pandas Data Munging/Wrangling)

在<用pandas进行数据清洗(一)(Data Analysis Pandas Data Munging/Wrangling)>中,我们介绍了数据清洗经常用到的一些pandas命令. 接下来看看这份数据的具体清洗步骤: Transaction_ID Transaction_Date Product_ID Quantity Unit_Price Total_Price 0 1 2010-08-21 2 1 30 30 1 2 2011-05-26 4 1 40 40 2 3 2011-06-16

《Python For Data Analysis》学习笔记-1

在引言章节里,介绍了MovieLens 1M数据集的处理示例.书中介绍该数据集来自GroupLens Research(http://www.groupLens.org/node/73),该地址会直接跳转到https://grouplens.org/datasets/movielens/,这里面提供了来自MovieLens网站的各种评估数据集,可以下载相应的压缩包,我们需要的MovieLens 1M数据集也在里面. 下载解压后的文件夹如下: 这三个dat表都会在示例中用到,但是我所阅读的<Pyt

Python 探索性数据分析(Exploratory Data Analysis,EDA)

此脚本读取的是 SQL Server ,只需给定表名或视图名称,如果有数据,将输出每个字段符合要求的每张数据分布图. # -*- coding: UTF-8 -*- # python 3.5.0 # 探索性数据分析(Exploratory Data Analysis,EDA) __author__ = 'HZC' import math import sqlalchemy import numpy as np import pandas as pd import matplotlib.pyplo

Python.Data.Analysis(PACKT,2014)pdf

下载地址:网盘下载 Finding great data analysts is difficult. Despite the explosive growth of data in industries ranging from manufacturing and retail to high technology, finance, and healthcare, learning and accessing data analysis tools has remained a challe

《python for data analysis》第十章,时间序列

< python for data analysis >一书的第十章例程, 主要介绍时间序列(time series)数据的处理.label:1. datetime object.timestamp object.period object2. pandas的Series和DataFrame object的两种特殊索引:DatetimeIndex 和 PeriodIndex3. 时区的表达与处理4. imestamp object.period object的频率概念,及其频率转换5. 两种频