Pandas Cookbook -- 08数据清理

数据清理

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

设定最大列数和最大行数

pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)

1 宽格式转长格式

state_fruit = pd.read_csv(‘data/state_fruit.csv‘, index_col=0)
state_fruit

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Apple Orange Banana
Texas 12 10 40
Arizona 9 7 12
Florida 0 14 190

1.1 stack

DataFrame.stack(level=-1, dropna=True)

  • Stack the prescribed level(s) from columns to index.
  • 将列索引转化为行索引
  • Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe:
    • if the columns have a single level, the output is a Series;
    • if the columns have multiple levels, the new index level(s) is (are) taken from the prescribed level(s) and the output is a DataFrame.

1.1.1 stack将列索引转化为行索引

stack方法可以将所有列名,转变为垂直的一级行索引

state_fruit.stack()
Texas    Apple      12
         Orange     10
                  ...
Florida  Orange     14
         Banana    190
Length: 9, dtype: int64

使用reset_index(),将结果变为DataFrame

state_fruit_tidy = state_fruit.stack().reset_index()
state_fruit_tidy

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

level_0 level_1 0
0 Texas Apple 12
1 Texas Orange 10
... ... ... ...
7 Florida Orange 14
8 Florida Banana 190

9 rows × 3 columns

重命名列名

state_fruit_tidy.columns = [‘state‘, ‘fruit‘, ‘weight‘]
state_fruit_tidy

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

state fruit weight
0 Texas Apple 12
1 Texas Orange 10
... ... ... ...
7 Florida Orange 14
8 Florida Banana 190

9 rows × 3 columns

也可以使用rename_axis给不同的行索引层级命名

state_fruit.stack().rename_axis([‘state‘, ‘fruit‘])
state    fruit
Texas    Apple      12
         Orange     10
                  ...
Florida  Orange     14
         Banana    190
Length: 9, dtype: int64

再次使用reset_index方法

state_fruit.stack().rename_axis([‘state‘, ‘fruit‘])           .reset_index(name=‘weight‘)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

state fruit weight
0 Texas Apple 12
1 Texas Orange 10
... ... ... ...
7 Florida Orange 14
8 Florida Banana 190

9 rows × 3 columns

1.1.2 同时stack多组变量

即stack后将所有的列名,按照规则分为两列,或者多列

movie = pd.read_csv(‘data/movie.csv‘)
actor = movie[[
    ‘movie_title‘, ‘actor_1_name‘, ‘actor_2_name‘,
    ‘actor_3_name‘, ‘actor_1_facebook_likes‘,
    ‘actor_2_facebook_likes‘, ‘actor_3_facebook_likes‘]]

创建一个自定义函数,用来改变列名。wide_to_long要求分组的变量要有相同的数值结尾

def change_col_name(col_name):
    col_name = col_name.replace(‘_name‘, ‘‘)
    if ‘facebook‘ in col_name:
        fb_idx = col_name.find(‘facebook‘)
        col_name = col_name[:5] + col_name[fb_idx - 1:] + col_name[5:fb_idx-1]
    return col_name
actor2 = actor.rename(columns=change_col_name)
actor2.iloc[:5,:5]

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

movie_title actor_1 actor_2 actor_3 actor_facebook_likes_1
0 Avatar CCH Pounder Joel David Moore Wes Studi 1000.0
1 Pirates of the Caribbean: At World‘s End Johnny Depp Orlando Bloom Jack Davenport 40000.0
2 Spectre Christoph Waltz Rory Kinnear Stephanie Sigman 11000.0
3 The Dark Knight Rises Tom Hardy Christian Bale Joseph Gordon-Levitt 27000.0
4 Star Wars: Episode VII - The Force Awakens Doug Walker Rob Walker NaN 131.0

使用wide_to_long函数,同时stack两列actor和Facebook

stubs = [‘actor‘, ‘actor_facebook_likes‘]
actor2_tidy = pd.wide_to_long(actor2,
    stubnames=stubs,
    i=[‘movie_title‘],
    j=‘actor_num‘,
    sep=‘_‘)
actor2_tidy.head(10)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

actor actor_facebook_likes
movie_title actor_num
Avatar 1 CCH Pounder 1000.0
Pirates of the Caribbean: At World‘s End 1 Johnny Depp 40000.0
... ... ... ...
Avengers: Age of Ultron 1 Chris Hemsworth 26000.0
Harry Potter and the Half-Blood Prince 1 Alan Rickman 25000.0

10 rows × 2 columns

1.2 melt

pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name=‘value‘, col_level=None)

  • This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
  • 这个函数是用来转化dataframe的格式,新的格式将满足一列或者多列作为id列,一列作为测量指标列,最后一列作为前面几列对应的测量值列
  • id_vars : tuple, list, or ndarray, optional
    • Column(s) to use as identifier variables.
  • value_vars : tuple, list, or ndarray, optional
    • Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
  • var_name : scalar
    • Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
  • value_name : scalar, default ‘value’
    • Name to use for the ‘value’ column.
  • col_level : int or string, optional
    • If columns are a MultiIndex then use this level to melt.

读取state_fruit2数据集

state_fruit2 = pd.read_csv(‘data/state_fruit2.csv‘)
state_fruit2

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

State Apple Orange Banana
0 Texas 12 10 40
1 Arizona 9 7 12
2 Florida 0 14 190

melt可以将原先的列名作为变量,原先的值作为值

var_name和value_name可以用来重命名新生成的变量列和值的列

state_fruit2.melt(id_vars=[‘State‘],value_vars=[‘Apple‘, ‘Orange‘, ‘Banana‘])

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

State variable value
0 Texas Apple 12
1 Arizona Apple 9
... ... ... ...
7 Arizona Banana 12
8 Florida Banana 190

9 rows × 3 columns

随意设定一个行索引

state_fruit2.index=list(‘abc‘)
state_fruit2.index.name = ‘letter‘
state_fruit2

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

State Apple Orange Banana
letter
a Texas 12 10 40
b Arizona 9 7 12
c Florida 0 14 190

var_name和value_name可以用来重命名新生成的变量列和值的列

var_name对应value_vars 默认为variable

value_name对应原有表格中的值 默认为value

state_fruit2.melt(id_vars=[‘State‘],
    value_vars=[‘Apple‘, ‘Orange‘, ‘Banana‘],
    var_name=‘Fruit‘,
    value_name=‘Weight‘)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

State Fruit Weight
0 Texas Apple 12
1 Arizona Apple 9
... ... ... ...
7 Arizona Banana 12
8 Florida Banana 190

9 rows × 3 columns

如果想让所有值都位于一列,旧的列标签位于另一列,可以直接使用melt

state_fruit2.melt()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

variable value
0 State Texas
1 State Arizona
... ... ...
10 Banana 12
11 Banana 190

12 rows × 2 columns

要指明id变量,只需使用id_vars参数

state_fruit2.melt(id_vars=‘State‘)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

State variable value
0 Texas Apple 12
1 Arizona Apple 9
... ... ... ...
7 Arizona Banana 12
8 Florida Banana 190

9 rows × 3 columns

2 长格式转宽格式

2.1 unstack

  • DataFrame.unstack(level=-1, fill_value=None)
  • level : int, string, or list of these, default -1 (last level)
    • Level(s) of index to unstack, can pass level name
    • 将行索引中的哪一级别展开
  • fill_value : replace NaN with this value if the unstack produces
    • missing values
    • 缺失值怎样填充

读取college数据集,学校名作为行索引,只选取本科生的列

usecol_func = lambda x: ‘UGDS_‘ in x or x == ‘INSTNM‘
college = pd.read_csv(‘data/college.csv‘,index_col=‘INSTNM‘, usecols=usecol_func)

用stack方法,将所有水平列名,转化为垂直的行索引

college_stacked = college.stack()
college_stacked.head(18)
INSTNM
Alabama A & M University             UGDS_WHITE    0.0333
                                     UGDS_BLACK    0.9353
                                                    ...
University of Alabama at Birmingham  UGDS_NRA      0.0179
                                     UGDS_UNKN     0.0100
Length: 18, dtype: float64

unstack方法可以将其还原

college_stacked.unstack().head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.0333 0.9353 ... 0.0059 0.0138
University of Alabama at Birmingham 0.5922 0.2600 ... 0.0179 0.0100
Amridge University 0.2990 0.4192 ... 0.0000 0.2715
University of Alabama in Huntsville 0.6988 0.1255 ... 0.0332 0.0350
Alabama State University 0.0158 0.9208 ... 0.0243 0.0137

5 rows × 9 columns

2.2 pivot

DataFrame.pivot(index=None, columns=None, values=None)

  • Return reshaped DataFrame organized by given index / column values.
  • 通过给的index和column返回一个重塑的dataframe
  • index : string or object, optional
    • Column to use to make new frame’s index. If None, uses existing index.
    • 原有的dataframe中的哪些列的值作为新的datarame索引
  • columns : string or object
    • Column to use to make new frame’s columns.
    • 原有的dataframe中的哪些列中的值作为新的datarame的列
  • values : string, object or a list of the previous, optional
    • Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.
    • 原有的datarame哪个列的值,作为新的表中的值

另一种方式是先用melt,再用pivot。先加载数据,不指定行索引名

college2 = pd.read_csv(‘data/college.csv‘, usecols=usecol_func)
college_melted = college2.melt(id_vars=‘INSTNM‘, var_name=‘Race‘,value_name=‘Percentage‘)
college_melted.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM Race Percentage
0 Alabama A & M University UGDS_WHITE 0.0333
1 University of Alabama at Birmingham UGDS_WHITE 0.5922
2 Amridge University UGDS_WHITE 0.2990
3 University of Alabama in Huntsville UGDS_WHITE 0.6988
4 Alabama State University UGDS_WHITE 0.0158

用pivot还原

melted_inv = college_melted.pivot(index=‘INSTNM‘,columns=‘Race‘,values=‘Percentage‘)
melted_inv.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Race UGDS_2MOR UGDS_AIAN ... UGDS_UNKN UGDS_WHITE
INSTNM
A & W Healthcare Educators 0.0000 0.0 ... 0.0000 0.0000
A T Still University of Health Sciences NaN NaN ... NaN NaN
ABC Beauty Academy 0.0000 0.0 ... 0.0000 0.0000
ABC Beauty College Inc 0.0000 0.0 ... 0.0000 0.2895
AI Miami International University of Art and Design 0.0018 0.0 ... 0.4644 0.0324

5 rows × 9 columns

3 透视表

3.1 使用pivot_table

pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=‘mean‘, fill_value=None, margins=False, dropna=True, margins_name=‘All‘)

Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame

类似pivot

  • data : DataFrame
  • values : column to aggregate, optional
  • index : column, Grouper, array, or list of the previous
    • If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
  • columns : column, Grouper, array, or list of the previous
    • If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
    • 当colomns传递的列表有多个元素时,也会建立入index的多级索引
  • aggfunc : function, list of functions, dict, default numpy.mean
    • If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions
    • 对所选取的值做聚合
  • fill_value : scalar, default None
    • Value to replace missing values with
flights = pd.read_csv(‘data/flights.csv‘)
flights.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

MONTH DAY ... DIVERTED CANCELLED
0 1 1 ... 0 0
1 1 1 ... 0 0
2 1 1 ... 0 0
3 1 1 ... 0 0
4 1 1 ... 0 0

5 rows × 14 columns

用pivot_table方法求出每条航线每个始发地的被取消的航班总数

fp = flights.pivot_table(index=‘AIRLINE‘,
    columns=‘ORG_AIR‘,
    values=‘CANCELLED‘,
    aggfunc=‘sum‘,
    fill_value=0).round(2)
fp.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

ORG_AIR ATL DEN ... PHX SFO
AIRLINE
AA 3 4 ... 4 2
AS 0 0 ... 0 0
B6 0 0 ... 0 1
DL 28 1 ... 1 2
EV 18 6 ... 0 0

5 rows × 10 columns

3.2 使用groupby方法

groupby聚合不能直接复现这张表,需要先按所有index和columns的列聚合.

fg = flights.groupby([‘AIRLINE‘, ‘ORG_AIR‘])[‘CANCELLED‘].sum()
fg.head()
AIRLINE  ORG_AIR
AA       ATL         3
         DEN         4
         DFW        86
         IAH         3
         LAS         3
Name: CANCELLED, dtype: int64

再使用unstack,将ORG_AIR这层索引作为列名

fg_unstack = fg.unstack(‘ORG_AIR‘, fill_value=0)
fg_unstack.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

ORG_AIR ATL DEN ... PHX SFO
AIRLINE
AA 3 4 ... 4 2
AS 0 0 ... 0 0
B6 0 0 ... 0 1
DL 28 1 ... 1 2
EV 18 6 ... 0 0

5 rows × 10 columns

3.3 两种方式的比较

fg_unstack = fg.unstack(‘ORG_AIR‘, fill_value=0)
fp.equals(fg_unstack)
True
fp2 = flights.pivot_table(index=[‘AIRLINE‘, ‘MONTH‘],
    columns=[‘ORG_AIR‘, ‘CANCELLED‘],
    values=[‘DEP_DELAY‘, ‘DIST‘],
    aggfunc=[np.mean, np.sum],
    fill_value=0)
fp2

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

mean ... sum
DEP_DELAY ... DIST
ORG_AIR ATL ... SFO
CANCELLED 0 1 ... 0 1
AIRLINE MONTH
AA 1 -3.250000 0 ... 33483 0
2 -3.000000 0 ... 32110 2586
... ... ... ... ... ... ...
WN 11 5.932203 0 ... 23235 784
12 15.691589 0 ... 30508 0

149 rows × 80 columns

用groupby和unstack复现上面的方法

flights.groupby([‘AIRLINE‘, ‘MONTH‘, ‘ORG_AIR‘, ‘CANCELLED‘])[‘DEP_DELAY‘, ‘DIST‘]     .agg([‘mean‘, ‘sum‘])     .unstack([‘ORG_AIR‘, ‘CANCELLED‘], fill_value=0)     .swaplevel(0, 1, axis=‘columns‘)     .head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

mean ... sum
DEP_DELAY ... DIST
ORG_AIR ATL ... SFO
CANCELLED 0 1 ... 0 1
AIRLINE MONTH
AA 1 -3.250000 NaN ... 33483.0 NaN
2 -3.000000 NaN ... 32110.0 2586.0
3 -0.166667 NaN ... 43580.0 NaN
4 0.071429 NaN ... 51054.0 NaN
5 5.777778 NaN ... 40233.0 NaN

5 rows × 80 columns

4 数据清理TIPS

一些数据分析的案例和技巧

4.1 为了更容易reshaping,重新命名索引层

读取college数据集,分组后,统计本科生的SAT数学成绩信息

college = pd.read_csv(‘data/college.csv‘)
cg = college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘, ‘SATMTMID‘].agg([‘count‘, ‘min‘, ‘max‘]).head(6)
cg

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

UGDS ... SATMTMID
count min ... min max
STABBR RELAFFIL
AK 0 7 109.0 ... NaN NaN
1 3 27.0 ... 503.0 503.0
... ... ... ... ... ... ...
AR 0 68 18.0 ... 427.0 565.0
1 14 20.0 ... 495.0 600.0

6 rows × 6 columns

行索引的两级都有名字,而列索引没有名字。用rename_axis给列索引的两级命名

cg = cg.rename_axis([‘AGG_COLS‘, ‘AGG_FUNCS‘], axis=‘columns‘)
cg

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

AGG_COLS UGDS ... SATMTMID
AGG_FUNCS count min ... min max
STABBR RELAFFIL
AK 0 7 109.0 ... NaN NaN
1 3 27.0 ... 503.0 503.0
... ... ... ... ... ... ...
AR 0 68 18.0 ... 427.0 565.0
1 14 20.0 ... 495.0 600.0

6 rows × 6 columns

将AGG_FUNCS列移到行索引

cg.stack(‘AGG_FUNCS‘).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

AGG_COLS UGDS SATMTMID
STABBR RELAFFIL AGG_FUNCS
AK 0 count 7.0 0.0
min 109.0 NaN
max 12865.0 NaN
1 count 3.0 1.0
min 27.0 503.0

stack默认是将列放到行索引的最内层,可以使用swaplevel改变层级

cg.stack(‘AGG_FUNCS‘).swaplevel(‘AGG_FUNCS‘, ‘STABBR‘, axis=‘index‘).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

AGG_COLS UGDS SATMTMID
AGG_FUNCS RELAFFIL STABBR
count 0 AK 7.0 0.0
min 0 AK 109.0 NaN
max 0 AK 12865.0 NaN
count 1 AK 3.0 1.0
min 1 AK 27.0 503.0

在此前的基础上再做sort_index

cg.stack(‘AGG_FUNCS‘)     .swaplevel(‘AGG_FUNCS‘, ‘STABBR‘, axis=‘index‘)     .sort_index(level=‘RELAFFIL‘, axis=‘index‘)     .sort_index(level=‘AGG_COLS‘, axis=‘columns‘).head(6)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

AGG_COLS SATMTMID UGDS
AGG_FUNCS RELAFFIL STABBR
count 0 AK 0.0 7.0
AL 13.0 71.0
... ... ... ... ...
min 0 AL 420.0 12.0
AR 427.0 18.0

6 rows × 2 columns

对一些列做stack,对其它列做unstack

cg.stack(‘AGG_FUNCS‘).unstack([‘RELAFFIL‘, ‘STABBR‘])

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

AGG_COLS UGDS ... SATMTMID
RELAFFIL 0 1 ... 0 1
STABBR AK AK ... AR AR
AGG_FUNCS
count 7.0 3.0 ... 9.0 7.0
min 109.0 27.0 ... 427.0 495.0
max 12865.0 275.0 ... 565.0 600.0

3 rows × 12 columns

对所有列做stack,会返回一个Series

cg.stack([‘AGG_FUNCS‘, ‘AGG_COLS‘]).head(12)
STABBR  RELAFFIL  AGG_FUNCS  AGG_COLS
AK      0         count      UGDS         7.0
                             SATMTMID     0.0
                                         ...
AL      0         count      UGDS        71.0
                             SATMTMID    13.0
Length: 12, dtype: float64

删除行和列索引所有层级的名称

cg.rename_axis([None, None], axis=‘index‘).rename_axis([None, None], axis=‘columns‘)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }

UGDS ... SATMTMID
count min ... min max
AK 0 7 109.0 ... NaN NaN
1 3 27.0 ... 503.0 503.0
... ... ... ... ... ... ...
AR 0 68 18.0 ... 427.0 565.0
1 14 20.0 ... 495.0 600.0

6 rows × 6 columns

4.2 当多个变量被存储为列名时进行清理

当多个变量被存储为列名时进行清理

读取weightlifting数据集

weightlifting = pd.read_csv(‘data/weightlifting_men.csv‘)
weightlifting

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Weight Category M35 35-39 ... M75 75-79 M80 80+
0 56 137 ... 62 55
1 62 152 ... 67 57
... ... ... ... ... ...
6 105 210 ... 95 80
7 105+ 217 ... 100 85

8 rows × 11 columns

用melt方法,将sex_age放入一个单独的列

wl_melt = weightlifting.melt(id_vars=‘Weight Category‘,
    var_name=‘sex_age‘,
    value_name=‘Qual Total‘)
wl_melt.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Weight Category sex_age Qual Total
0 56 M35 35-39 137
1 62 M35 35-39 152
2 69 M35 35-39 167
3 77 M35 35-39 182
4 85 M35 35-39 192

用split方法将sex_age列分为两列

sex_age = wl_melt[‘sex_age‘].str.split(expand=True)
sex_age.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

0 1
0 M35 35-39
1 M35 35-39
2 M35 35-39
3 M35 35-39
4 M35 35-39
sex_age.columns = [‘Sex‘, ‘Age Group‘]
sex_age.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Sex Age Group
0 M35 35-39
1 M35 35-39
2 M35 35-39
3 M35 35-39
4 M35 35-39

只取出字符串中的M

sex_age[‘Sex‘] = sex_age[‘Sex‘].str[0]
sex_age.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Sex Age Group
0 M 35-39
1 M 35-39
2 M 35-39
3 M 35-39
4 M 35-39

用concat方法,将sex_age,与wl_cat_total连接起来

wl_cat_total = wl_melt[[‘Weight Category‘, ‘Qual Total‘]]
wl_tidy = pd.concat([sex_age, wl_cat_total], axis=‘columns‘)
wl_tidy.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Sex Age Group Weight Category Qual Total
0 M 35-39 56 137
1 M 35-39 62 152
2 M 35-39 69 167
3 M 35-39 77 182
4 M 35-39 85 192

上面的结果也可以如下实现

cols = [‘Weight Category‘, ‘Qual Total‘]
sex_age[cols] = wl_melt[cols]

也可以通过assign的方法,动态加载新的列

age_group = wl_melt.sex_age.str.extract(‘(\d{2}[-+](?:\d{2})?)‘, expand=False)
sex = wl_melt.sex_age.str[0]
new_cols = {‘Sex‘:sex,‘Age Group‘: age_group}
wl_tidy2 = wl_melt.assign(**new_cols).drop(‘sex_age‘, axis=‘columns‘)
wl_tidy2.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Weight Category Qual Total Sex Age Group
0 56 137 M 35-39
1 62 152 M 35-39
2 69 167 M 35-39
3 77 182 M 35-39
4 85 192 M 35-39

4.3 当多个变量被存储为列的值时进行清理

读取restaurant_inspections数据集,将Date列的数据类型变为datetime64

inspections = pd.read_csv(‘data/restaurant_inspections.csv‘, parse_dates=[‘Date‘])
inspections.head(10)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Name Date Info Value
0 E & E Grill House 2017-08-08 Borough MANHATTAN
1 E & E Grill House 2017-08-08 Cuisine American
... ... ... ... ...
8 PIZZA WAGON 2017-04-12 Grade A
9 PIZZA WAGON 2017-04-12 Score 10.0

10 rows × 4 columns

4.3.1 stack方式

inspections.set_index([‘Name‘,‘Date‘, ‘Info‘]).unstack(‘Info‘).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

Value
Info Borough Cuisine Description Grade Score
Name Date
3 STAR JUICE CENTER 2017-05-10 BROOKLYN Juice, Smoothies, Fruit Salads Facility not vermin proof. Harborage or condit... A 12.0
A & L PIZZA RESTAURANT 2017-08-22 BROOKLYN Pizza Facility not vermin proof. Harborage or condit... A 9.0
AKSARAY TURKISH CAFE AND RESTAURANT 2017-07-25 BROOKLYN Turkish Plumbing not properly installed or maintained;... A 13.0
ANTOJITOS DELI FOOD 2017-06-01 BROOKLYN Latin (Cuban, Dominican, Puerto Rican, South &... Live roaches present in facility‘s food and/or... A 10.0
BANGIA 2017-06-16 MANHATTAN Korean Covered garbage receptacle not provided or ina... A 9.0

用reset_index方法,使行索引层级与列索引相同

insp_tidy = inspections.set_index([‘Name‘,‘Date‘, ‘Info‘])     .unstack(‘Info‘)     .reset_index(col_level=-1)
insp_tidy.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }

... Value
Info Name Date ... Grade Score
0 3 STAR JUICE CENTER 2017-05-10 ... A 12.0
1 A & L PIZZA RESTAURANT 2017-08-22 ... A 9.0
2 AKSARAY TURKISH CAFE AND RESTAURANT 2017-07-25 ... A 13.0
3 ANTOJITOS DELI FOOD 2017-06-01 ... A 10.0
4 BANGIA 2017-06-16 ... A 9.0

5 rows × 7 columns

除掉列索引的最外层,重命名行索引的层为None

insp_tidy.columns = insp_tidy.columns.droplevel(0).rename(None)
insp_tidy.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Name Date ... Grade Score
0 3 STAR JUICE CENTER 2017-05-10 ... A 12.0
1 A & L PIZZA RESTAURANT 2017-08-22 ... A 9.0
2 AKSARAY TURKISH CAFE AND RESTAURANT 2017-07-25 ... A 13.0
3 ANTOJITOS DELI FOOD 2017-06-01 ... A 10.0
4 BANGIA 2017-06-16 ... A 9.0

5 rows × 7 columns

4.3.2 pivot_table方式

pivot_table需要传入聚合函数,才能产生一个单一值

inspections.pivot_table(index=[‘Name‘, ‘Date‘],
         columns=‘Info‘,
         values=‘Value‘,
         aggfunc=‘first‘)     .reset_index()    .rename_axis(None, axis=‘columns‘)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Name Date ... Grade Score
0 3 STAR JUICE CENTER 2017-05-10 ... A 12.0
1 A & L PIZZA RESTAURANT 2017-08-22 ... A 9.0
... ... ... ... ... ...
98 WANG MANDOO HOUSE 2017-08-29 ... A 12.0
99 XIAOYAN YABO INC 2017-08-29 ... Z 49.0

100 rows × 7 columns

# inspections.pivot(index=[‘Name‘, ‘Date‘], columns=‘Info‘, values=‘Value‘)
# 运行pivot会报错,因为没有聚合函数,通过[‘Name‘, ‘Date‘]索引和columns=‘Info‘对应的是多个值

4.4 当两个或多个值存储于一个单元格时进行清理

读取texas_cities数据集

cities = pd.read_csv(‘data/texas_cities.csv‘)
cities

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

City Geolocation
0 Houston 29.7604° N, 95.3698° W
1 Dallas 32.7767° N, 96.7970° W
2 Austin 30.2672° N, 97.7431° W

将Geolocation分解为四个单独的列

geolocations = cities.Geolocation.str.split(pat=‘. ‘, expand=True)
geolocations.columns = [‘latitude‘, ‘latitude direction‘, ‘longitude‘, ‘longitude direction‘]
geolocations

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

latitude latitude direction longitude longitude direction
0 29.7604 N 95.3698 W
1 32.7767 N 96.7970 W
2 30.2672 N 97.7431 W

转变数据类型

geolocations = geolocations.astype({‘latitude‘:‘float‘, ‘longitude‘:‘float‘})
geolocations.dtypes
latitude               float64
latitude direction      object
longitude              float64
longitude direction     object
dtype: object

将新列与原先的city列连起来

cities_tidy = pd.concat([cities[‘City‘], geolocations], axis=‘columns‘)
cities_tidy

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

City latitude latitude direction longitude longitude direction
0 Houston 29.7604 N 95.3698 W
1 Dallas 32.7767 N 96.7970 W
2 Austin 30.2672 N 97.7431 W

函数to_numeric可以将每列自动变为整数或浮点数

temp = geolocations.apply(pd.to_numeric, errors=‘ignore‘)
temp

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

latitude latitude direction longitude longitude direction
0 29.7604 N 95.3698 W
1 32.7767 N 96.7970 W
2 30.2672 N 97.7431 W

|符,可以对多个标记进行分割

 cities.Geolocation.str.split(pat=‘° |, ‘, expand=True)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

0 1 2 3
0 29.7604 N 95.3698 W
1 32.7767 N 96.7970 W
2 30.2672 N 97.7431 W

更复杂的提取方式

cities.Geolocation.str.extract(‘([0-9.]+). (N|S), ([0-9.]+). (E|W)‘, expand=True)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

0 1 2 3
0 29.7604 N 95.3698 W
1 32.7767 N 96.7970 W
2 30.2672 N 97.7431 W

4.5 当多个变量被存储为列名和列值时进行清理

读取sensors数据集

sensors = pd.read_csv(‘data/sensors.csv‘)
sensors

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Group Property ... 2015 2016
0 A Pressure ... 973 870
1 A Temperature ... 1036 1042
... ... ... ... ... ...
4 B Temperature ... 1002 1013
5 B Flow ... 824 873

6 rows × 7 columns

用melt清理数据

sensors.melt(id_vars=[‘Group‘, ‘Property‘], var_name=‘Year‘).head(6)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Group Property Year value
0 A Pressure 2012 928
1 A Temperature 2012 1026
... ... ... ... ...
4 B Temperature 2012 1008
5 B Flow 2012 887

6 rows × 4 columns

用pivot_table,将Property列转化为新的列名

sensors.melt(id_vars=[‘Group‘, ‘Property‘], var_name=‘Year‘)     .pivot_table(index=[‘Group‘, ‘Year‘], columns=‘Property‘, values=‘value‘)     .reset_index()     .rename_axis(None, axis=‘columns‘)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Group Year Flow Pressure Temperature
0 A 2012 819 928 1026
1 A 2013 806 873 1038
... ... ... ... ... ...
8 B 2015 824 806 1002
9 B 2016 873 942 1013

10 rows × 5 columns

用stack和unstack实现上述方法

sensors.set_index([‘Group‘, ‘Property‘])     .stack()     .unstack(‘Property‘)     .rename_axis([‘Group‘, ‘Year‘], axis=‘index‘)     .rename_axis(None, axis=‘columns‘)     .reset_index()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

Group Year Flow Pressure Temperature
0 A 2012 819 928 1026
1 A 2013 806 873 1038
... ... ... ... ... ...
8 B 2015 824 806 1002
9 B 2016 873 942 1013

10 rows × 5 columns

4.6 当多个观察单位被存储于同一张表时进行清理

就是将一张表拆成多张表

读取movie_altered数据集

movie = pd.read_csv(‘data/movie_altered.csv‘)
movie.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

title rating ... actor_fb_likes_2 actor_fb_likes_3
0 Avatar PG-13 ... 936.0 855.0
1 Pirates of the Caribbean: At World‘s End PG-13 ... 5000.0 1000.0
2 Spectre PG-13 ... 393.0 161.0
3 The Dark Knight Rises PG-13 ... 23000.0 23000.0
4 Star Wars: Episode VII - The Force Awakens NaN ... 12.0 NaN

5 rows × 12 columns

插入新的列,用来标识每一部电影

movie.insert(0, ‘id‘, np.arange(len(movie)))

用wide_to_long,将所有演员放到一列,将所有Facebook likes放到一列

stubnames = [‘director‘, ‘director_fb_likes‘, ‘actor‘, ‘actor_fb_likes‘]
movie_long = pd.wide_to_long(movie, stubnames=stubnames, i=‘id‘, j=‘num‘, sep=‘_‘).reset_index()
movie_long[‘num‘] = movie_long[‘num‘].astype(int)
movie_long.head(9)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

id num ... actor actor_fb_likes
0 0 1 ... CCH Pounder 1000.0
1 0 2 ... Joel David Moore 936.0
... ... ... ... ... ...
7 2 2 ... Rory Kinnear 393.0
8 2 3 ... Stephanie Sigman 161.0

9 rows × 10 columns

movie.columns
Index([‘id‘, ‘title‘, ‘rating‘, ‘year‘, ‘duration‘, ‘director_1‘,
       ‘director_fb_likes_1‘, ‘actor_1‘, ‘actor_2‘, ‘actor_3‘,
       ‘actor_fb_likes_1‘, ‘actor_fb_likes_2‘, ‘actor_fb_likes_3‘],
      dtype=‘object‘)
movie_long.columns
Index([‘id‘, ‘num‘, ‘year‘, ‘duration‘, ‘rating‘, ‘title‘, ‘director‘,
       ‘director_fb_likes‘, ‘actor‘, ‘actor_fb_likes‘],
      dtype=‘object‘)

将这个数据分解成多个小表

movie_table = movie_long[[‘id‘,‘title‘, ‘year‘, ‘duration‘, ‘rating‘]]
director_table = movie_long[[‘id‘, ‘director‘, ‘num‘, ‘director_fb_likes‘]]
actor_table = movie_long[[‘id‘, ‘actor‘, ‘num‘, ‘actor_fb_likes‘]]

做一些去重和去除缺失值的工作

movie_table = movie_table.drop_duplicates().reset_index(drop=True)
director_table = director_table.dropna().reset_index(drop=True)
actor_table = actor_table.dropna().reset_index(drop=True)

原文地址:https://www.cnblogs.com/shiyushiyu/p/9800795.html

时间: 2024-10-08 04:44:20

Pandas Cookbook -- 08数据清理的相关文章

python pandas 获取列数据的几种方法及书写形式比较

pandas获取列数据位常用功能,但在写法上还有些要注意的地方,在这里总结一下: ''' author: zilu.tang 2015-12-31 ''' import pandas as pd data1 = pd.DataFrame(...) #任意初始化一个列数为3的DataFrame data1.columns=['a', 'b', 'c'] 1. data1['b'] #这里取到第2列(即b列)的值 2. data1.b #效果同1,取第2列(即b列) #这里b为列名称,但必须是连续字

新浪微博爬取笔记(4):数据清理

数据清理的部分很多,其实爬数据的过程中步骤的间隔也要做数据清理,都是很琐碎繁杂的工作.总结经验的话,就是: 1.一定要用数据库存储数据 (我因为还不太会数据库,为了“节省学习时间”,所有数据项都用txt存储,直到最后出现了多个种类之间查找,文件夹树变得比较复杂,才觉得当初即使使用MySQL也会提高效率) 2.处理异常的语句不嫌多 3.处理数据的脚本最好打包成函数,尽量减少运行前需要改源码的机会,变量从外部传递 4.工作流程要整体写出来画成图方便查找,步骤和文件多了会有点混乱 以处理时间为例: 我

(版本定制)第16课:Spark Streaming源码解读之数据清理内幕彻底解密

本期内容: 1.Spark Streaming元数据清理详解 2.Spark Streaming元数据清理源码解析 一.如何研究Spark Streaming元数据清理 操作DStream的时候会产生元数据,所以要解决RDD的数据清理工作就一定要从DStream入手.因为DStream是RDD的模板,DStream之间有依赖关系. DStream的操作产生了RDD,接收数据也靠DStream,数据的输入,数据的计算,输出整个生命周期都是由DStream构建的.由此,DStream负责RDD的整个

Spark Streaming发行版笔记16:数据清理内幕彻底解密

本讲从二个方面阐述: 数据清理原因和现象 数据清理代码解析 Spark Core从技术研究的角度讲 对Spark Streaming研究的彻底,没有你搞不定的Spark应用程序. Spark Streaming一直在运行,不断计算,每一秒中在不断运行都会产生大量的累加器.广播变量,所以需要对对象及 元数据需要定期清理.每个batch duration运行时不断触发job后需要清理rdd和元数据.Clinet模式 可以看到打印的日志,从文件日志也可以看到清理日志内容. 现在要看其背后的事情: Sp

Spark Streaming源码解读之数据清理内幕彻底解密

本期内容 : Spark Streaming数据清理原理和现象 Spark Streaming数据清理代码解析 Spark Streaming一直在运行的,在计算的过程中会不断的产生RDD ,如每秒钟产生一个BachDuration同时也会产生RDD, 在这个过程中除了基本的RDD外还有累加器.广播变量等,对应Spark Streaming也有自己的对象.源数据及数据清理机制, 在运行中每个BachDuration会触发了Job ,由于会自动产生对象.数据及源数据等运行完成后肯定要自动进行回收 

【推荐系统】协同过滤--高度稀疏数据下的数据清理(皮尔逊相关系数)

向量之间的相似度 度量向量之间的相似度方法很多了,你可以用距离(各种距离)的倒数,向量夹角,Pearson相关系数等. 皮尔森相关系数计算公式如下: 分子是协方差,分子是两个变量标准差的乘积.显然要求X和Y的标准差都不能为0. 因为,所以皮尔森相关系数计算公式还可以写成: 当两个变量的线性关系增强时,相关系数趋于1或-1. 用户评分预测 用户评分预测的基本原理是: step1.如果用户i对项目j没有评过分,就找到与用户i最相似的K个邻居(使用向量相似度度量方法) step2.然后用这K个邻居对项

第16课:Spark Streaming源码解读之数据清理内幕彻底解密

本期内容: Spark Streaming数据清理原因和现象 Spark Streaming数据清理代码解析 对Spark Streaming解析了这么多课之后,我们越来越能感知,Spark Streaming只是基于Spark Core的一个应用程序,因此掌握Spark Streaming对于我们怎么编写Spark应用是绝对有好处的. Spark Streaming 不像Spark Core的应用程序,Spark Core的应用的数据是存储在底层文件系统,如HDFS等别的存储系统中,而Spar

将pandas的DataFrame数据写入MySQL数据库 + sqlalchemy

将pandas的DataFrame数据写入MySQL数据库 + sqlalchemy [python] view plain copy print? import pandas as pd from sqlalchemy import create_engine ##将数据写入mysql的数据库,但需要先通过sqlalchemy.create_engine建立连接,且字符编码设置为utf8,否则有些latin字符不能处理 yconnect = create_engine('mysql+mysql

Spark 定制版:016~Spark Streaming源码解读之数据清理内幕彻底解密

本讲内容: a. Spark Streaming数据清理原因和现象 b. Spark Streaming数据清理代码解析 注:本讲内容基于Spark 1.6.1版本(在2016年5月来说是Spark最新版本)讲解. 上节回顾 上一讲中,我们之所以用一节课来讲No Receivers,是因为企业级Spark Streaming应用程序开发中在越来越多的采用No Receivers的方式.No Receiver方式有自己的优势,比如更大的控制的自由度.语义一致性等等.所以对No Receivers方