问题导读:
1.合并数据集
2.重塑和轴向旋转
3.数据转换(待续)
解决方案:
合并数据集
(1)数据库风格的DataFrame合并
- pandas的merge 函数 将通过一个或多个键将行连接起来
- 如果没有指定列,merge 就会直接依据相同列名的那一列进行连接
In [3]: df1 = pd.DataFrame( ...: {'key':['b','b','a','c','a','a','b'], ...: 'data1':range(7)} ...: ) In [4]: df1 Out[4]: data1 key 0 0 b 1 1 b 2 2 a 3 3 c 4 4 a 5 5 a 6 6 b [7 rows x 2 columns] In [5]: df2 = pd.DataFrame( ...: {'key':['a','b','d'], ...: 'data2':range(3)} ...: ) In [6]: df2 Out[6]: data2 key 0 0 a 1 1 b 2 2 d [3 rows x 2 columns] In [7]: pd.merge(df1,df2) Out[7]: data1 key data2 0 0 b 1 1 1 b 1 2 6 b 1 3 2 a 0 4 4 a 0 5 5 a 0 [6 rows x 3 columns] In [8]: pd.merge(df1,df2, on='key') Out[8]: data1 key data2 0 0 b 1 1 1 b 1 2 6 b 1 3 2 a 0 4 4 a 0 5 5 a 0 [6 rows x 3 columns]
- 如果两个对象的列名不同,也可以在指定之后,进行合并
In [10]: df3 = pd.DataFrame({'lkey':['b','b','a','c','a','a','b'], ....: 'data1':range(7)}) In [11]: df4 = pd.DataFrame({'rkey':['a','b','d'],'data2':range(3)}) In [12]: pd.merge(df3,df4,left_on='lkey',right_on='rkey') Out[12]: data1 lkey data2 rkey 0 0 b 1 b 1 1 b 1 b 2 6 b 1 b 3 2 a 0 a 4 4 a 0 a 5 5 a 0 a [6 rows x 4 columns]
- merge 做的是交集,其他方式有“left”,“right”,“outer”
- outer 外连接,求取的是键的并集
In [13]: pd.merge(df1,df2,how='outer') Out[13]: data1 key data2 0 0 b 1 1 1 b 1 2 6 b 1 3 2 a 0 4 4 a 0 5 5 a 0 6 3 c NaN 7 NaN d 2 [8 rows x 3 columns]
- 多对多
In [15]: df1 = pd.DataFrame({'key':['b','b','a','c','a','b'], ....: 'data1':range(6)}) In [16]: df2 = pd.DataFrame({'key':['a','b','a','b','d'], ....: 'data2':range(5)}) In [17]: pd.merge(df1,df2,on='key', how = 'left') Out[17]: data1 key data2 0 0 b 1 1 0 b 3 2 1 b 1 3 1 b 3 4 5 b 1 5 5 b 3 6 2 a 0 7 2 a 2 8 4 a 0 9 4 a 2 10 3 c NaN [11 rows x 3 columns] In [18]: pd.merge(df1,df2,on='key', how = 'right') Out[18]: data1 key data2 0 0 b 1 1 1 b 1 2 5 b 1 3 0 b 3 4 1 b 3 5 5 b 3 6 2 a 0 7 4 a 0 8 2 a 2 9 4 a 2 10 NaN d 4 [11 rows x 3 columns] In [19]: pd.merge(df1,df2,on='key',how='inner') Out[19]: data1 key data2 0 0 b 1 1 0 b 3 2 1 b 1 3 1 b 3 4 5 b 1 5 5 b 3 6 2 a 0 7 2 a 2 8 4 a 0 9 4 a 2 [10 rows x 3 columns] In [21]: pd.merge(df1,df2,on='key',how='outer') Out[21]: data1 key data2 0 0 b 1 1 0 b 3 2 1 b 1 3 1 b 3 4 5 b 1 5 5 b 3 6 2 a 0 7 2 a 2 8 4 a 0 9 4 a 2 10 3 c NaN 11 NaN d 4 [12 rows x 3 columns]
- 根据多个键进行合并,并传入一个由列名组成的列表即可
In [27]: left = pd.DataFrame({'key1':['foo','foo','bar'], 'key2':['one','two','one'], 'key3':[1,2,3]}) In [28]: right = pd.DataFrame({'key1':['foo','foo','foo','bar'], 'key2':['one','one','one','two'], 'rval':[4,5,6,7]}) In [29]: pd.merge(left,right,on=['key1','key2'],how='outer') Out[29]: key1 key2 key3 rval 0 foo one 1 4 1 foo one 1 5 2 foo one 1 6 3 foo two 2 NaN 4 bar one 3 NaN 5 bar two NaN 7 [6 rows x 4 columns]
- merge 的suffixes选项,用于附加到左右两个Dataframe对象的重叠列名上的字符串
- 用来解决为合并的相同列名的区分
In [30]: pd.merge(left,right,on='key1') Out[30]: key1 key2_x key3 key2_y rval 0 foo one 1 one 4 1 foo one 1 one 5 2 foo one 1 one 6 3 foo two 2 one 4 4 foo two 2 one 5 5 foo two 2 one 6 6 bar one 3 two 7 [7 rows x 5 columns] In [31]: pd.merge(left,right,on='key1',suffixes=('_left','_right')) Out[31]: key1 key2_left key3 key2_right rval 0 foo one 1 one 4 1 foo one 1 one 5 2 foo one 1 one 6 3 foo two 2 one 4 4 foo two 2 one 5 5 foo two 2 one 6 6 bar one 3 two 7 [7 rows x 5 columns]
- merge 函数的参数
(2)索引上的合并
- DataFrame 的索引也可以被用作连接键
- 传入 left_index = True / right_index = True
In [37]: left1 = pd.DataFrame({'key':['a','b','a','a','b','c'], 'value':range(6)}) In [38]: right1 = pd.DataFrame({'group_val':[3.5,7]},index = ['a','b']) In [39]: pd.merge(left1,right1,left_on='key',right_index=True) Out[39]: key value group_val 0 a 0 3.5 2 a 2 3.5 3 a 3 3.5 1 b 1 7.0 4 b 4 7.0 [5 rows x 3 columns] In [40]: right1 Out[40]: group_val a 3.5 b 7.0 [2 rows x 1 columns] In [41]: left1 Out[41]: key value 0 a 0 1 b 1 2 a 2 3 a 3 4 b 4 5 c 5 [6 rows x 2 columns]
- 对于层次化的索引,我们要以列表的形式指明用作合并键的多个列
In [48]: lefth = pd.DataFrame({'key1':['Ohio','Ohio','Ohio','Nevada','Nevada'], 'key2':[2000,2001,2002,2001,2002], 'data':np.arange(5.)}) In [49]: righth = pd.DataFrame(np.arange(12).reshape((6,2)), index = [['Nevada','Nevada','Ohio','Ohio','Ohio','Ohio'], [2001,2000,2000,2000,2001,2002]], columns = ['event1','event2']) In [50]: lefth Out[50]: data key1 key2 0 0 Ohio 2000 1 1 Ohio 2001 2 2 Ohio 2002 3 3 Nevada 2001 4 4 Nevada 2002 [5 rows x 3 columns] In [52]: righth Out[52]: event1 event2 Nevada 2001 0 1 2000 2 3 Ohio 2000 4 5 2000 6 7 2001 8 9 2002 10 11 [6 rows x 2 columns] In [53]: pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True) Out[53]: data key1 key2 event1 event2 0 0 Ohio 2000 4 5 0 0 Ohio 2000 6 7 1 1 Ohio 2001 8 9 2 2 Ohio 2002 10 11 3 3 Nevada 2001 0 1 [5 rows x 5 columns] In [54]: pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer') Out[54]: data key1 key2 event1 event2 0 0 Ohio 2000 4 5 0 0 Ohio 2000 6 7 1 1 Ohio 2001 8 9 2 2 Ohio 2002 10 11 3 3 Nevada 2001 0 1 4 4 Nevada 2002 NaN NaN 4 NaN Nevada 2000 2 3 [7 rows x 5 columns]
- 同时使用合并双方的索引
In [55]: left2 = pd.DataFrame([[1.,2.],[3.,4.],[5.,6.]], index = ['a','c','e'], ....: columns= ['Ohio','Nevada']) In [56]: right2 = pd.DataFrame([[7.,8.],[9.,10.],[11.,12],[13,14]], ....: index = ['b','c','d','e'],columns=['Missouri','Alabama']) In [57]: left2 Out[57]: Ohio Nevada a 1 2 c 3 4 e 5 6 [3 rows x 2 columns] In [58]: right2 Out[58]: Missouri Alabama b 7 8 c 9 10 d 11 12 e 13 14 [4 rows x 2 columns] In [59]: pd.merge(left2,right2,how='outer',left_index=True,right_index=True) Out[59]: Ohio Nevada Missouri Alabama a 1 2 NaN NaN b NaN NaN 7 8 c 3 4 9 10 d NaN NaN 11 12 e 5 6 13 14 [5 rows x 4 columns]
- DataFrame有一个join 实例方法,他可以更为方便地实现按索引合并
- join 方法默认作的是左连接
- 该方法还可以使调用者的某个列跟参数DataFrame的索引进行连接
In [60]: left2.join(right2,how='outer') Out[60]: Ohio Nevada Missouri Alabama a 1 2 NaN NaN b NaN NaN 7 8 c 3 4 9 10 d NaN NaN 11 12 e 5 6 13 14 [5 rows x 4 columns] In [61]: left2.join(right2) Out[61]: Ohio Nevada Missouri Alabama a 1 2 NaN NaN c 3 4 9 10 e 5 6 13 14 [3 rows x 4 columns] In [62]: left1 Out[62]: key value 0 a 0 1 b 1 2 a 2 3 a 3 4 b 4 5 c 5 [6 rows x 2 columns] In [63]: right1 Out[63]: group_val a 3.5 b 7.0 [2 rows x 1 columns] In [64]: left1.join(right1,on='key') Out[64]: key value group_val 0 a 0 3.5 1 b 1 7.0 2 a 2 3.5 3 a 3 3.5 4 b 4 7.0 5 c 5 NaN [6 rows x 3 columns]
- join 一组DataFrame(后面会了解 concat 函数)
In [65]: another = pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[16.,17.]], ....: index=['a','c','e','f'],columns=['New York','Oregon']) In [66]: left2.join([right2,another]) Out[66]: Ohio Nevada Missouri Alabama New York Oregon a 1 2 NaN NaN 7 8 c 3 4 9 10 9 10 e 5 6 13 14 11 12 [3 rows x 6 columns]
(3)轴向连接
- 轴向连接实际上可以理解为数据的连接、绑定、或堆叠
# coding=utf-8 import numpy as np import pandas as pd """ numpy 的concatenate 函数:该函数用于合并原始的numpy 的数组 """ arr = np.arange(12).reshape((3, 4)) print arr ''' [[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11]] ''' print np.concatenate([arr, arr], axis=1) ''' [[ 0 1 2 3 0 1 2 3] [ 4 5 6 7 4 5 6 7] [ 8 9 10 11 8 9 10 11]] ''' """ 对于pandas 对象 Series 和 DataFrame 他们的带有标签的轴使你能够进一步推广数组的连接运算 """ """ (1) 各对象其他轴上的索引不同时,那些轴做的交集还是并集? 答案:默认是并集 """ # concat 默认在axis=0 上工作 s1 = pd.Series([0,1], index=['a','b']) # s1 = pd.Series([1,2,2], index=['d','b','f']) s2 = pd.Series([2,3,4], index=['c','d','e']) s3 = pd.Series([5,6], index=['f','g']) print pd.concat([s1,s2,s3]) ''' a 0 b 1 c 2 d 3 e 4 f 5 g 6 dtype: int64 ''' # 如果要求在axis=1 进行操作则结果变成一个DataFrame print pd.concat([s1,s2,s3], axis=1) ''' 0 1 2 a 0 NaN NaN b 1 NaN NaN c NaN 2 NaN d NaN 3 NaN e NaN 4 NaN f NaN NaN 5 g NaN NaN 6 ''' # 传入的如果是join=‘inner’ 即可得到他们的交集 s4 = pd.concat([s1 * 5, s3]) print pd.concat([s1,s4], axis=1) ''' dtype: int64 0 1 a 0 0 b 1 5 f NaN 5 g NaN 6 [4 rows x 2 columns] ''' print pd.concat([s1,s4],axis=1,join='inner') ''' 0 1 a 0 0 b 1 5 [2 rows x 2 columns] ''' # 指定轴进行连接操作,如果连接的两个Series 都没有该轴,则值为NaN print pd.concat([s1,s4], axis=1, join_axes=[['a','c','b','e']]) ''' 0 1 a 0 0 c NaN NaN b 1 5 e NaN NaN [4 rows x 2 columns] ''' # 参与连接的片段的在结果中区分不开,这时候我们使用keys 参数建立一个层次化索引 result = pd.concat([s1,s2,s3], keys=['one','two','three']) print result ''' one a 0 b 1 two c 2 d 3 e 4 three f 5 g 6 dtype: int64 ''' """ s1 = pd.Series([1,2,6], index=['a','b','f']) s2 = pd.Series([3,4], index=['c','d']) s3 = pd.Series([5,6], index=['e','f']) result = pd.concat([s1,s2,s3], keys=['one','two','three']) print result ''' one a 1 b 2 f 6 two c 3 d 4 three e 5 f 6 dtype: int64 ''' """ # print result.unstack() # 沿axis=1 对series 进行合并,则keys 就会成为DataFrame的列头 print pd.concat([s1,s2,s3], axis=1, keys=['one','two','three']) print pd.concat([s1,s2,s3], axis=1) ''' dtype: int64 one two three a 0 NaN NaN b 1 NaN NaN c NaN 2 NaN d NaN 3 NaN e NaN 4 NaN f NaN NaN 5 g NaN NaN 6 [7 rows x 3 columns] 0 1 2 a 0 NaN NaN b 1 NaN NaN c NaN 2 NaN d NaN 3 NaN e NaN 4 NaN f NaN NaN 5 g NaN NaN 6 [7 rows x 3 columns] '''
- 以上对于Series 的逻辑对于DataFrame也一样
# coding=utf-8 import pandas as pd import numpy as np df1 = pd.DataFrame(np.arange(6).reshape(3,2), index=['a','b','c'], columns=['one','two']) df2 = pd.DataFrame(5+np.arange(4).reshape(2,2), index=['a','c'], columns=['three','four']) print pd.concat([df1,df2],axis=1, keys=['level1','levle2']) ''' level1 levle2 one two three four a 0 1 5 6 b 2 3 NaN NaN c 4 5 7 8 [3 rows x 4 columns] ''' # 如果传入的不是列表而是一个字典,则字典的建就会被当做keys选项的值 dic = {'level1':df1, 'level2':df2} print pd.concat(dic,axis=1) ''' level1 level2 one two three four a 0 1 5 6 b 2 3 NaN NaN c 4 5 7 8 [3 rows x 4 columns] '''
- concat 函数的参数列表
# coding=utf-8 import pandas as pd import numpy as np df1 = pd.DataFrame(np.random.randn(3,4), columns=['a','b','c','d']) df2 = pd.DataFrame(np.random.randn(2,3), columns=['b','d','a']) print pd.concat([df1,df2]) # 不保留连接轴的索引,创建新索引 print pd.concat([df1,df2], ignore_index=True) print pd.concat([df1,df2], ignore_index=False)
(4)合并重叠数据
# coding=utf-8 import pandas as pd import numpy as np a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=['f','e','d','c','b','a']) b = pd.Series(np.arange(len(a), dtype=np.float64), index=['f','e','d','c','b','a']) b[-1] = np.nan print np.where(pd.isnull(a), b, a) print b[:-2].combine_first(a[2:]) ''' [ 0. 2.5 2. 3.5 4.5 nan] a NaN b 4.5 c 3.0 d 2.0 e 1.0 f 0.0 dtype: float64 ''' # 对于DataFrame, combine_first 自然也是会在列上做同样的事情,我们可以把这个动作当成是参数对象数据对调用对象的数据打补丁 df1 = pd.DataFrame({ 'a':[1.,np.nan, 5., np.nan], 'b':[np.nan, 2., np.nan, 6.], 'c':range(2,18,4) }) df2 = pd.DataFrame({ 'a':[5., 4., np.nan, 3., 7.], 'b':[np.nan, 3., 4., 6., 8.] }) print df1.combine_first(df2)
重塑和轴向旋转
(1)重塑层次化索引
# coding=utf-8 import pandas as pd import numpy as np """ 层次化索引为DataFrame 数据的重排任务提供了一种具有良好一致性的方式 (1) stack 将数据的列“旋转”为行 (2) unstack 将数据的行“旋转”为列 """ data = pd.DataFrame(np.arange(6).reshape((2,3)), index=pd.Index(['Ohio', 'Colorado'], name='state'), columns=pd.Index(['one','two','three'], name='number') ) print data # 我们使用stack方法将DataFrame的列转换成行,得到一个Series result = data.stack() print result # 同样使用unstack也可以将一个层次化索引的Series 转化得到一个DataFrame print result.unstack() ''' number one two three state Ohio 0 1 2 Colorado 3 4 5 [2 rows x 3 columns] state number Ohio one 0 two 1 three 2 Colorado one 3 two 4 three 5 dtype: int64 number one two three state Ohio 0 1 2 Colorado 3 4 5 [2 rows x 3 columns] ''' # unstack操作的是最内层的(stack也是) # 当我们传入的分层级别的编号或名称,同样可以对其他级别进行unstack 操作 print result.unstack(0) == result.unstack('state') ''' state Ohio Colorado number one True True two True True three True True [3 rows x 2 columns] '''
- 当不是所有级别值都能在各分组中找到,unstack 操作可能会引入缺失数据
- stack 默认会滤除缺失数据,因此该运算是可逆的
# coding=utf-8 import pandas as pd import numpy as np s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd']) s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e']) data2 = pd.concat([s1, s2], keys=['one','two']) print data2 # 连接 -> 层次化索引的Series -> DataFrame(行:每个Series;列:Series的索引) print data2.unstack() ''' one a 0 b 1 c 2 d 3 two c 4 d 5 e 6 dtype: int64 a b c d e one 0 1 2 3 NaN two NaN NaN 4 5 6 [2 rows x 5 columns] ''' print data2.unstack().stack() print data2.unstack().stack(dropna=False) ''' one a 0 b 1 c 2 d 3 two c 4 d 5 e 6 dtype: float64 one a 0 b 1 c 2 d 3 e NaN two a NaN b NaN c 4 d 5 e 6 dtype: float64 '''
- 转转轴级别最低
# 对DataFrame进行操作,旋转轴 的级别将会成为结果中最低级别 df = pd.DataFrame( {'left': result, 'right': result + 5}, columns=pd.Index(['left', 'right'], name='side') ) print df.unstack('state') print df.unstack('state').stack('side')
(2)将“长格式”旋转为“宽格式”
- 将原始数据转换成我们需要的数据格式
- 关系型数据库中存储结构
# coding=utf-8 import pandas as pd import numpy as np data = pd.read_csv("/home/peerslee/py/pydata/pydata-book-master/ch07/macrodata.csv") frame01 = pd.DataFrame(data, columns=['year', 'realgdp', 'infl', 'unemp']) path01 = '/home/peerslee/py/pydata/pydata-book-master/ch07/macrodata01.csv' frame01.to_csv(path01, index=False, header=False) # 将去除index 和 header names02 = ['year', 'realgdp', 'infl', 'unemp'] frame02 = pd.read_table(path01, sep=',', names=names02, index_col='year') # 将'year'列设置为行索引 path02 = '/home/peerslee/py/pydata/pydata-book-master/ch07/macrodata02.csv' frame02.stack().to_csv(path02) # 轴向旋转之后写入文件中,year列会自动向后填充 names03 = ['date','item','value'] frame03 = pd.read_table(path02, sep=',', names=names03,) print frame03 ''' date item value 0 1959 realgdp 2710.349 1 1959 infl 0.000 2 1959 unemp 5.800 3 1959 realgdp 2778.801 4 1959 infl 2.340 5 1959 unemp 5.100 6 1959 realgdp 2775.488 7 1959 infl 2.740 8 1959 unemp 5.300 9 1959 realgdp 2785.204 10 1959 infl 0.270 ''' result_path = '/home/peerslee/py/pydata/pydata-book-master/ch07/result_data.csv' frame03.to_csv(result_path) # 将数据保存起来
- 关系型数据库(例如mysql)中就是这样存储数据的
但是我在进行数据pivot 的时候,出现了错误: raise ValueError('Index contains duplicate entries, ' ValueError: Index contains duplicate entries, cannot reshape
- 所以我们要将月份也加进来,但是我不知道,怎么搞,先放在这,以后再说(如果朋友哪位朋友有思路欢迎交流学习)
- 我们在操作上面的数据的时候没有那么轻松,我们常常要将数据转换为DataFrame 的格式,不同的item 值分别形成一列
# coding=utf-8 import pandas as pd import numpy as np """ DataFrame 中的pivot方法可以将“长格式”旋转为“宽格式” """ # 因为没有数据所以我们只能自己写点 ldata = pd.DataFrame({'date':['03-31','03-31','03-31','06-30','06-30','06-30'], 'item':['real','infl','unemp','real','infl','unemp'], 'value':['2710.','000.','5.8','2778.','2.34','5.1'] }) print ldata # 将date作为行索引的名字,item为列索引的名字,将value填充进去 pivoted = ldata.pivot('date','item','value') print pivoted ''' item infl real unemp date 03-31 000. 2710. 5.8 06-30 2.34 2778. 5.1 [2 rows x 3 columns] ''' # 将需要重塑的列扩为两列 ldata['value2'] = np.random.randn(len(ldata)) print ldata # 忽略pivot的最后一个参数,会得到一个带有层次化的列 pivoted = ldata.pivot('date','item') print pivoted ''' value value2 item infl real unemp infl real unemp date 03-31 000. 2710. 5.8 1.059406 0.437246 0.106987 06-30 2.34 2778. 5.1 -1.087665 -0.811100 -0.579266 [2 rows x 6 columns] ''' print pivoted['value'][:5] ''' item infl real unemp date 03-31 000. 2710. 5.8 06-30 2.34 2778. 5.1 [2 rows x 3 columns] ''' # 这个操作是完整的pivot 操作 unstacked = ldata.set_index(['date','item']).unstack('item') print unstacked ''' value value2 item infl real unemp infl real unemp date 03-31 000. 2710. 5.8 -1.018416 -1.476397 1.579151 06-30 2.34 2778. 5.1 0.863437 1.606538 -1.147549 [2 rows x 6 columns] '''
数据转换:
1.移除重复数据
# coding=utf-8 import pandas as pd import numpy as np data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2': [1, 1, 2, 3, 3, 4, 4]}) print data ''' k1 k2 0 one 1 1 one 1 2 one 2 3 two 3 4 two 3 5 two 4 6 two 4 [7 rows x 2 columns] ''' """ DataFrame 的duplicated 方法返回一个布尔型Series,表示各行是否是重复行 """ print data.duplicated() ''' 0 False 1 True 2 False 3 False 4 True 5 False 6 True dtype: bool ''' """ DataFrame 的drop_duplicated 方法用于移除了重复行的DataFrame """ print data.drop_duplicates() ''' k1 k2 0 one 1 2 one 2 3 two 3 5 two 4 [4 rows x 2 columns] ''' """ 具体根据 某一列 来判断是否有重复值 """ data['v1'] = range(7) print data.drop_duplicates(['k1']) ''' k1 k2 v1 0 one 1 0 3 two 3 3 [2 rows x 3 columns] ''' """ 默认保留的是第1个值,如果想保留最后一个值则传入 take_last=True """ print data.drop_duplicates(['k1','k2'], take_last=True) ''' k1 k2 v1 1 one 1 1 2 one 2 2 4 two 3 4 6 two 4 6 [4 rows x 3 columns] '''
2.利用函数或映射进行数据转换
# coding=utf-8 import pandas as pd import numpy as np # 一张食物和重量的表格 data = pd.DataFrame({'food':['bacon','pulled pork','bacon','Pastrami','corned beef', 'Bacon','pastrami','honey ham','nova lox'], 'ounces':[4,3,12,6,7.5,8,3,5,6]}) # 食物和动物的表格映射 meat_to_ainmal = { 'bacon':'pig', 'pulled pork':'pig', 'pastrami':'cow', 'corned beef':'pig', 'nova lox':'salmon' } """ Series 的map 方法接受一个函数或含有映射关系的字典型对象 """ # 这里我们先将表1 中的所有食物转换为小写,再做一个map data['animal'] = data['food'].map(str.lower).map(meat_to_ainmal) print data ''' food ounces animal 0 bacon 4.0 pig 1 pulled pork 3.0 pig 2 bacon 12.0 pig 3 Pastrami 6.0 cow 4 corned beef 7.5 pig 5 Bacon 8.0 pig 6 pastrami 3.0 cow 7 honey ham 5.0 NaN 8 nova lox 6.0 salmon [9 rows x 3 columns] ''' """ 我们可以传入一个能够完成全部这些工作的函数 """
3.替换值
# coding=utf-8 import pandas as pd import numpy as np data = pd.Series([1., -999., 2., -999., -1000., 3.]) """ 我们要使用pandas 将-999 这样的数据替换成 NA 值 """ data.replace(-999,np.nan) """ 如果希望一次性替换多个值,可以传入一个由待替换值组成的列表以及一个替换值 """ data.replace([-999,-1000],np.nan) """ 如果希望对不同的值进行不同的替换,则传入一个由替换关系组成的列表即可 """ data.replace([-999,-1000],[np.nan,0]) """ 传入的参数也可以是字典 """ data.replace({-999:np.nan, -1000:0})
4.重命名轴索引
# coding=utf-8 import pandas as pd import numpy as np """ 轴标签可以通过函数或映射进行转换,从而得到一个新对象,轴还可以被就地修改 """ data = pd.DataFrame(np.arange(12).reshape((3,4)), index=['Ohio','Colorado','New York',], columns=['one','two','three','four']) """ 轴标签也有一个map 方法 """ print data.index.map(str.upper) """ 直接赋值给index """ data.index = data.index.map(str.upper) print data """ 创建数据集的转换版本,而不是修改原始数据 """ data.rename(index=str.title, columns=str.upper) """ rename 可以结合字典型对象实现对部分轴标签的更新 """ data.rename(index={'OHIO':'INDIANA'}, columns={'three','peekaboo'}) """ rename 也可以就地对DataFrame 进行修改,inplace=True """ _=data.rename(index={'OHIO':'INDIANA'},inplace=True)
时间: 2024-10-10 01:51:11