一:改变索引
reindex方法对于Series直接索引,对于DataFrame既可以改变行索引,也可以改变列索引,还可以两个一起改变.
1)对于Series
1 In [2]: seri = pd.Series([4.5,7.2,-5.3,3.6],index = [‘d‘,‘b‘,‘a‘,‘c‘]) 2 3 In [3]: seri 4 Out[3]: 5 d 4.5 6 b 7.2 7 a -5.3 8 c 3.6 9 dtype: float64 10 11 In [4]: seri1 = seri.reindex([‘a‘,‘b‘,‘c‘,‘d‘,‘e‘]) 12 13 In [5]: seri1 14 Out[5]: 15 a -5.3 16 b 7.2 17 c 3.6 18 d 4.5 19 e NaN #没有的即为NaN 20 dtype: float64 21 22 In [6]: seri.reindex([‘a‘,‘b‘,‘c‘,‘d‘,‘e‘], fill_value=0) 23 Out[6]: 24 a -5.3 25 b 7.2 26 c 3.6 27 d 4.5 28 e 0.0 #没有的填充为0 29 dtype: float64 30 31 In [7]: seri 32 Out[7]: 33 d 4.5 34 b 7.2 35 a -5.3 36 c 3.6 37 dtype: float64 38 39 In [8]: seri_2 = pd.Series([‘blue‘,‘purple‘,‘yellow‘], index=[0,2,4]) 40 41 In [9]: seri_2 42 Out[9]: 43 0 blue 44 2 purple 45 4 yellow 46 dtype: object 47 48 #reindex可用的方法:ffill为向前填充,bfill为向后填充 49 50 In [10]: seri_2.reindex(range(6),method=‘ffill‘) 51 Out[10]: 52 0 blue 53 1 blue 54 2 purple 55 3 purple 56 4 yellow 57 5 yellow 58 dtype: object 59 60 In [11]: seri_2.reindex(range(6),method=‘bfill‘) 61 Out[11]: 62 0 blue 63 1 purple 64 2 purple 65 3 yellow 66 4 yellow 67 5 NaN 68 dtype: object
Series的改变索引
2)对于DataFrame
其reindex的函数参数:method="ffill/bfill";fill_value=...[若为NaN时的填充值];......
1 In [4]: dframe_1 = pd.DataFrame(np.arange(9).reshape((3,3)),index=[‘a‘,‘b‘,‘c‘], 2 columns=[‘Ohio‘,‘Texas‘,‘Cal‘]) 3 In [5]: dframe_1 4 Out[5]: 5 Ohio Texas Cal 6 a 0 1 2 7 b 3 4 5 8 c 6 7 8 9 10 In [6]: dframe_2 = dframe_1.reindex([‘a‘,‘b‘,‘c‘,‘d‘]) 11 12 In [7]: dframe_2 13 Out[7]: 14 Ohio Texas Cal 15 a 0 1 2 16 b 3 4 5 17 c 6 7 8 18 d NaN NaN NaN 19 20 In [16]: dframe_1.reindex(index=[‘a‘,‘b‘,‘c‘,‘d‘],method=‘ffill‘,columns=[‘Ohio‘ 21 ,‘Beijin‘,‘Cal‘]) 22 Out[16]: 23 Ohio Beijin Cal 24 a 0 NaN 2 25 b 3 NaN 5 26 c 6 NaN 8 27 d 6 NaN 8 28 29 In [17]: dframe_1.reindex(index=[‘a‘,‘b‘,‘c‘,‘d‘],fill_value=‘Z‘,columns=[‘Ohio‘ 30 Out[17]: ,‘Cal‘]) 31 Ohio Beijin Cal 32 a 0 Z 2 33 b 3 Z 5 34 c 6 Z 8 35 d Z Z Z 36 37 In [8]: dframe_1.reindex(columns=[‘Chengdu‘,‘Beijin‘,‘Shanghai‘,‘Guangdong‘]) 38 Out[8]: 39 Chengdu Beijin Shanghai Guangdong 40 a NaN NaN NaN NaN 41 b NaN NaN NaN NaN 42 c NaN NaN NaN NaN 43 44 In [9]: dframe_1 45 Out[9]: 46 Ohio Texas Cal 47 a 0 1 2 48 b 3 4 5 49 c 6 7 8 50 51 #用ix关键字同时改变行/列索引 52 In [10]: dframe_1.ix[[‘a‘,‘b‘,‘c‘,‘d‘],[‘Ohio‘,‘Beijing‘,‘Guangdong‘]] 53 Out[10]: 54 Ohio Beijing Guangdong 55 a 0 NaN NaN 56 b 3 NaN NaN 57 c 6 NaN NaN 58 d NaN NaN NaN
DataFrame的改变索引
二:丢弃指定轴的数据
drop方法, 通过索引删除
1)对于Series
1 In [21]: seri = pd.Series(np.arange(5),index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘]) 2 3 In [22]: seri 4 Out[22]: 5 a 0 6 b 1 7 c 2 8 d 3 9 e 4 10 dtype: int32 11 12 In [23]: seri.drop(‘b‘) 13 Out[23]: 14 a 0 15 c 2 16 d 3 17 e 4 18 dtype: int32 19 20 In [24]: seri.drop([‘d‘,‘e‘]) 21 Out[24]: 22 a 0 23 b 1 24 c 2 25 dtype: int32
Series的删除数据
2)对于DataFrame
1 In [29]: dframe = pd.DataFrame(np.arange(16).reshape((4,4)),index=[‘Chen‘,‘Bei‘, 2 ‘Shang‘,‘Guang‘],columns=[‘one‘,‘two‘,‘three‘,‘four‘]) 3 4 In [30]: dframe 5 Out[30]: 6 one two three four 7 Chen 0 1 2 3 8 Bei 4 5 6 7 9 Shang 8 9 10 11 10 Guang 12 13 14 15 11 12 #删除行 13 In [31]: dframe.drop([‘Bei‘,‘Shang‘]) 14 Out[31]: 15 one two three four 16 Chen 0 1 2 3 17 Guang 12 13 14 15 18 19 #删除列 20 In [33]: dframe.drop([‘two‘,‘three‘],axis=1) 21 Out[33]: 22 one four 23 Chen 0 3 24 Bei 4 7 25 Shang 8 11 26 Guang 12 15 27 28 #若第一个参数只有一个时可以不要【】
DataFrame的删除数据
三:索引,选取,过滤
1)Series
仍然可以向list那些那样用下标访问,不过我觉得不太还,最好还是选择用索引值来进行访问,并且索引值也可用于切片
In [4]: seri = pd.Series(np.arange(4),index=[‘a‘,‘b‘,‘c‘,‘d‘]) In [5]: seri Out[5]: a 0 b 1 c 2 d 3 dtype: int32 In [6]: seri[‘a‘] Out[6]: 0 In [7]: seri[[‘b‘,‘a‘]] #显示顺序也变了 Out[7]: b 1 a 0 dtype: int32 In [18]: seri[seri<2] #!!元素级别运算!! Out[18]: a 0 b 1 dtype: int32 In [11]: seri[‘a‘:‘c‘] #索引用于切片 Out[11]: a 0 b 1 c 2 dtype: int32 In [12]: seri[‘a‘:‘c‘]=‘z‘ In [13]: seri Out[13]: a z b z c z d 3 dtype: object
Series选取
2)DataFrame
其实就是获取一个或多个列的问题。需要注意的是,其实DataFrame可以看作多列索引相同的Series组成的,对应DataFrame数据来说,其首行横向的字段才应该看作是他的索引,所以通过dframe【【n个索引值】】可以选出多列Series,而其中的索引值必须是首行横向的字段,否者报错。而想要取列的话可以通过切片完成,如dframe[:2]选出第0和1行。通过ix【参数1(x),参数2(y)】可以在两个方向上进行选取。
1 In [19]: dframe = pd.DataFrame(np.arange(16).reshape((4,4)),index=[‘one‘,‘two‘,‘ 2 three‘,‘four‘],columns=[‘Bei‘,‘Shang‘,‘Guang‘,‘Sheng‘]) 3 4 In [21]: dframe 5 Out[21]: 6 Bei Shang Guang Sheng 7 one 0 1 2 3 8 two 4 5 6 7 9 three 8 9 10 11 10 four 12 13 14 15 11 12 In [22]: dframe[[‘one‘]] #即是开头讲的索引值用的不正确而报错 13 --------------------------------------------------------------------------- 14 KeyError Traceback (most recent call last) 15 <ipython-input-22-c2522043b676> in <module>() 16 ----> 1 dframe[[‘one‘]] 17 18 In [25]: dframe[[‘Bei‘]] 19 Out[25]: 20 Bei 21 one 0 22 two 4 23 three 8 24 four 12 25 26 In [26]: dframe[[‘Bei‘,‘Sheng‘]] 27 Out[26]: 28 Bei Sheng 29 one 0 3 30 two 4 7 31 three 8 11 32 four 12 15 33 34 In [27]: dframe[:2] #取行 35 Out[27]: 36 Bei Shang Guang Sheng 37 one 0 1 2 3 38 two 4 5 6 7 39 40 In [32]: #为了在DataFrame中引入标签索引,用ix字段,其第一个参数是对行的控制,第二个为对列的控制 41 42 In [33]: dframe.ix[[‘one‘,‘two‘],[‘Bei‘,‘Shang‘]] 43 Out[33]: 44 Bei Shang 45 one 0 1 46 two 4 5 47 48 #有此可看出横向的每个字段为dframe实例的属性 49 In [35]: dframe.Bei 50 Out[35]: 51 one 0 52 two 4 53 three 8 54 four 12 55 Name: Bei, dtype: int32 56 57 In [36]: dframe[dframe.Bei<5] 58 Out[36]: 59 Bei Shang Guang Sheng 60 one 0 1 2 3 61 two 4 5 6 7 62 63 In [38]: dframe.ix[dframe.Bei<5,:2] 64 Out[38]: 65 Bei Shang 66 one 0 1 67 two 4 5 68 69 In [43]: dframe.ix[:‘two‘,[‘Shang‘,‘Bei‘]] 70 Out[43]: 71 Shang Bei 72 one 1 0 73 two 5 4
DataFrame选取
四:算术运算
1)Series
在运算时会自动按索引对齐后再运算,且在索引值不重叠时产生的运算结果是NaN值, 用运算函数时可以避免此情况。
1 In [4]: seri_1 = pd.Series([1,2,3,4],index = [‘a‘,‘b‘,‘c‘,‘d‘]) 2 3 In [5]: seri_2 = pd.Series([5,6,7,8,9],index = [‘a‘,‘c‘,‘e‘,‘g‘,‘f‘]) 4 5 In [6]: seri_1 + seri_2 6 Out[6]: 7 a 6 8 b NaN 9 c 9 10 d NaN 11 e NaN 12 f NaN 13 g NaN 14 dtype: float64 15 16 In [8]: seri_1.add(seri_2) 17 Out[8]: 18 a 6 19 b NaN 20 c 9 21 d NaN 22 e NaN 23 f NaN 24 g NaN 25 dtype: float64 26 27 In [7]: seri_1.add(seri_2,fill_value = 0) 28 Out[7]: 29 a 6 30 b 2 31 c 9 32 d 4 33 e 7 34 f 9 35 g 8 36 dtype: float64 37 38 #上面的未重叠区依然有显示值而不是NaN!! 39 #对应的方法是:add:+; mul: X; sub: -; div : /
Series算术运算
2)DataFrame
1 In [10]: df_1 = pd.DataFrame(np.arange(12).reshape((3,4)),columns = list(‘abcd‘) 2 ) 3 In [11]: df_2 = pd.DataFrame(np.arange(20).reshape((4,5)),columns = list(‘abcde‘ 4 )) 5 In [12]: df_1 + df_2 6 Out[12]: 7 a b c d e 8 0 0 2 4 6 NaN 9 1 9 11 13 15 NaN 10 2 18 20 22 24 NaN 11 3 NaN NaN NaN NaN NaN 12 13 In [13]: df_1.add(df_2) 14 Out[13]: 15 a b c d e 16 0 0 2 4 6 NaN 17 1 9 11 13 15 NaN 18 2 18 20 22 24 NaN 19 3 NaN NaN NaN NaN NaN 20 21 In [14]: df_1.add(df_2, fill_value = 0) 22 Out[14]: 23 a b c d e 24 0 0 2 4 6 4 25 1 9 11 13 15 9 26 2 18 20 22 24 14 27 3 15 16 17 18 19
DataFrame算术运算
3)DataFrame与Series之间进行运算
类似:np.array
1 In [15]: arr_1 = np.arange(12).reshape((3,4)) 2 3 In [16]: arr_1 - arr_1[0] 4 Out[16]: 5 array([[0, 0, 0, 0], 6 [4, 4, 4, 4], 7 [8, 8, 8, 8]]) 8 9 In [17]: arr_1 10 Out[17]: 11 array([[ 0, 1, 2, 3], 12 [ 4, 5, 6, 7], 13 [ 8, 9, 10, 11]])
array型
1 In [18]: dframe_1 = pd.DataFrame(np.arange(12).reshape((4,3)),columns=list(‘bde‘ 2 ),index = [‘Chen‘,‘Bei‘,‘Shang‘,‘Sheng‘]) 3 In [19]: dframe_1 4 Out[19]: 5 b d e 6 Chen 0 1 2 7 Bei 3 4 5 8 Shang 6 7 8 9 Sheng 9 10 11 10 11 In [20]: seri = dframe_1.ix[0] 12 13 In [21]: seri 14 Out[21]: 15 b 0 16 d 1 17 e 2 18 Name: Chen, dtype: int32 19 20 In [22]: dframe_1 - seri #每行匹配的进行运算 21 Out[22]: 22 b d e 23 Chen 0 0 0 24 Bei 3 3 3 25 Shang 6 6 6 26 Sheng 9 9 9 27 28 In [23]: seri_2 = pd.Series(range(3),index=[‘b‘,‘e‘,‘f‘]) 29 30 In [24]: dframe_1 - seri_2 31 Out[24]: 32 b d e f 33 Chen 0 NaN 1 NaN 34 Bei 3 NaN 4 NaN 35 Shang 6 NaN 7 NaN 36 Sheng 9 NaN 10 NaN 37 38 In [27]: seri_3 = dframe_1[‘d‘] 39 40 In [28]: seri_3 #注意!Serie_3索引并不与dframe_1的相同,与上面的运算形式不同 41 Out[28]: 42 Chen 1 43 Bei 4 44 Shang 7 45 Sheng 10 46 Name: d, dtype: int32 47 48 In [29]: dframe_1 - seri_3 49 Out[29]: 50 Bei Chen Shang Sheng b d e 51 Chen NaN NaN NaN NaN NaN NaN NaN 52 Bei NaN NaN NaN NaN NaN NaN NaN 53 Shang NaN NaN NaN NaN NaN NaN NaN 54 Sheng NaN NaN NaN NaN NaN NaN NaN 55 #注意dframe的columns已经变成了Series的index和其自己的columns相加了 56 57 #通过运算函数中的axis参数可改变匹配轴以避免上情况 58 #0为列匹配,1为行匹配 59 In [31]: dframe_1.sub(seri_3,axis=0) 60 Out[31]: 61 b d e 62 Chen -1 0 1 63 Bei -1 0 1 64 Shang -1 0 1 65 Sheng -1 0 1 66 67 In [33]: dframe_1.sub(seri_3,axis=1) 68 Out[33]: 69 Bei Chen Shang Sheng b d e 70 Chen NaN NaN NaN NaN NaN NaN NaN 71 Bei NaN NaN NaN NaN NaN NaN NaN 72 Shang NaN NaN NaN NaN NaN NaN NaN 73 Sheng NaN NaN NaN NaN NaN NaN NaN
DataFrame & Series运算
注:axis按轴取可以看成 0:以index为index的Series【竖轴】, 1:以colum为index的Series【横轴】
五:使用函数
使用函数
1 In [6]: dframe=pd.DataFrame(np.random.randn(4,3),columns=list(‘bde‘),index=[‘Che 2 n‘,‘Bei‘,‘Shang‘,‘Sheng‘]) 3 In [7]: dframe 4 Out[7]: 5 b d e 6 Chen 1.838620 1.023421 0.641420 7 Bei 0.920563 -2.037778 -0.853871 8 Shang -0.587332 0.576442 0.596269 9 Sheng 0.366174 -0.689582 -1.064030 10 11 In [8]: np.abs(dframe) #绝对值函数 12 Out[8]: 13 b d e 14 Chen 1.838620 1.023421 0.641420 15 Bei 0.920563 2.037778 0.853871 16 Shang 0.587332 0.576442 0.596269 17 Sheng 0.366174 0.689582 1.064030 18 19 In [9]: func = lambda x: x.max() - x.min() 20 21 In [10]: dframe.apply(func) 22 Out[10]: 23 b 2.425952 24 d 3.061200 25 e 1.705449 26 dtype: float64 27 28 In [11]: dframe.apply(func,axis=1) 29 Out[11]: 30 Chen 1.197200 31 Bei 2.958341 32 Shang 1.183602 33 Sheng 1.430204 34 dtype: float64 35 36 In [12]: dframe.max() #即dframe.max(axis=0) 37 Out[12]: 38 b 1.838620 39 d 1.023421 40 e 0.641420 41 dtype: float64 42 43 In [15]: dframe.max(axis=1) 44 Out[15]: 45 Chen 1.838620 46 Bei 0.920563 47 Shang 0.596269 48 Sheng 0.366174 49 dtype: float64
六:排序
1)按索引排序:sort_index(【axis=0/1,ascending=False/True】)注,其中默认axis为0(index排序),ascending为True(升序)
1 In [16]: seri = pd.Series(range(4),index=[‘d‘,‘a‘,‘d‘,‘c‘]) 2 3 In [17]: seri 4 Out[17]: 5 d 0 6 a 1 7 d 2 8 c 3 9 dtype: int64 10 11 In [18]: seri.sort_index() 12 Out[18]: 13 a 1 14 c 3 15 d 2 16 d 0 17 dtype: int64
Series的索引排序
1 In [22]: dframe 2 Out[22]: 3 c a b 4 Chen 1.838620 1.023421 0.641420 5 Bei 0.920563 -2.037778 -0.853871 6 Shang -0.587332 0.576442 0.596269 7 Sheng 0.366174 -0.689582 -1.064030 8 9 In [23]: dframe.sort_index() 10 Out[23]: 11 c a b 12 Bei 0.920563 -2.037778 -0.853871 13 Chen 1.838620 1.023421 0.641420 14 Shang -0.587332 0.576442 0.596269 15 Sheng 0.366174 -0.689582 -1.064030 16 17 In [24]: dframe.sort_index(axis=1) 18 Out[24]: 19 a b c 20 Chen 1.023421 0.641420 1.838620 21 Bei -2.037778 -0.853871 0.920563 22 Shang 0.576442 0.596269 -0.587332 23 Sheng -0.689582 -1.064030 0.366174
DataFrame的索引排序,用axis制定是按index(默认)还是columns进行排序(1)
2)按值排序sort_values方法【注:order方法已不推荐使用了】
1 In [32]: seri =pd.Series([4,7,np.nan,-1,2,np.nan]) 2 3 In [33]: seri 4 Out[33]: 5 0 4 6 1 7 7 2 NaN 8 3 -1 9 4 2 10 5 NaN 11 dtype: float64 12 13 In [34]: seri.sort_values() 14 Out[34]: 15 3 -1 16 4 2 17 0 4 18 1 7 19 2 NaN 20 5 NaN 21 dtype: float64 22 23 #NaN值会默认排到最后
Series的值排序
1 In [38]: dframe = pd.DataFrame({‘b‘:[4,7,-3,2],‘a‘:[0,1,0,1]}) 2 3 In [39]: dframe 4 Out[39]: 5 a b 6 0 0 4 7 1 1 7 8 2 0 -3 9 3 1 2 10 11 In [54]: dframe.sort_values(‘a‘) 12 Out[54]: 13 a b 14 0 0 4 15 2 0 -3 16 1 1 7 17 3 1 2 18 19 In [55]: dframe.sort_values(‘b‘) 20 Out[55]: 21 a b 22 2 0 -3 23 3 1 2 24 0 0 4 25 1 1 7 26 27 In [57]: dframe.sort_values([‘a‘,‘b‘]) 28 Out[57]: 29 a b 30 2 0 -3 31 0 0 4 32 3 1 2 33 1 1 7 34 35 In [58]: dframe.sort_values([‘b‘,‘a‘]) 36 Out[58]: 37 a b 38 2 0 -3 39 3 1 2 40 0 0 4 41 1 1 7
DataFrame的值排序
七:排名
rank方法
八:统计计算
count:非NaN值 describe:对Series或DataFrame列计算汇总统计 min,max argmin,argmax(整数值):最值得索引值 idmax,idmin:最值索引值
sum mean:平均数 var:样本方差 std:样本标准差 kurt:峰值 cumsum:累积和 cummin/cummax:累计最值 pct_change:百分数变化
1 In [63]: df = pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]] 2 ,index=[‘a‘,‘b‘,‘c‘,‘d‘],columns=[‘one‘,‘two‘]) 3 4 In [64]: df 5 Out[64]: 6 one two 7 a 1.40 NaN 8 b 7.10 -4.5 9 c NaN NaN 10 d 0.75 -1.3 11 12 In [66]: df.sum() 13 Out[66]: 14 one 9.25 15 two -5.80 16 dtype: float64 17 18 In [67]: df.sum(axis=1) 19 Out[67]: 20 a 1.40 21 b 2.60 22 c NaN 23 d -0.55 24 dtype: float64 25 26 #求平均值,skipna:跳过NaN 27 In [68]: df.mean(axis=1,skipna=False) 28 Out[68]: 29 a NaN 30 b 1.300 31 c NaN 32 d -0.275 33 dtype: float64 34 35 36 In [70]: df.idxmax() 37 Out[70]: 38 one b 39 two d 40 dtype: object 41 42 In [71]: df.cumsum() 43 Out[71]: 44 one two 45 a 1.40 NaN 46 b 8.50 -4.5 47 c NaN NaN 48 d 9.25 -5.8 49 50 In [72]: df.describe() 51 Out[72]: 52 one two 53 count 3.000000 2.000000 54 mean 3.083333 -2.900000 55 std 3.493685 2.262742 56 min 0.750000 -4.500000 57 25% 1.075000 -3.700000 58 50% 1.400000 -2.900000 59 75% 4.250000 -2.100000 60 max 7.100000 -1.300000
一些统计计算
九:唯一值,值计数,以及成员资格
unique方法 value_counts:顶级方法 isin方法
1 In [74]: seri = pd.Series([‘c‘,‘a‘,‘d‘,‘a‘,‘a‘,‘b‘,‘b‘,‘c‘,‘c‘]) 2 3 In [75]: seri 4 Out[75]: 5 0 c 6 1 a 7 2 d 8 3 a 9 4 a 10 5 b 11 6 b 12 7 c 13 8 c 14 dtype: object 15 16 In [76]: seri.unique() 17 Out[76]: array([‘c‘, ‘a‘, ‘d‘, ‘b‘], dtype=object) 18 19 In [77]: seri.value_counts() 20 Out[77]: 21 c 3 22 a 3 23 b 2 24 d 1 25 dtype: int64 26 27 In [78]: pd.value_counts(seri.values,sort=False) 28 Out[78]: 29 a 3 30 c 3 31 b 2 32 d 1 33 dtype: int64 34 35 36 In [81]: seri.isin([‘b‘,‘c‘]) 37 Out[81]: 38 0 True 39 1 False 40 2 False 41 3 False 42 4 False 43 5 True 44 6 True 45 7 True 46 8 True 47 dtype: bool
唯一值,值计数,成员资格
十:缺少数据处理
一)删除NaN:dropna方法
1)Series
python中的None即是对应到的Numpy的NaN
1 In [3]: seri = pd.Series([‘aaa‘,‘bbb‘,np.nan,‘ccc‘]) 2 3 In [4]: seri[0]=None 4 5 In [5]: seri 6 Out[5]: 7 0 None 8 1 bbb 9 2 NaN 10 3 ccc 11 dtype: object 12 13 In [7]: seri.isnull() 14 Out[7]: 15 0 True 16 1 False 17 2 True 18 3 False 19 dtype: bool 20 21 In [8]: seri.dropna() #返回非NaN值 22 Out[8]: 23 1 bbb 24 3 ccc 25 dtype: object 26 27 In [9]: seri 28 Out[9]: 29 0 None 30 1 bbb 31 2 NaN 32 3 ccc 33 dtype: object 34 35 In [10]: seri[seri.notnull()] #返回非空值 36 Out[10]: 37 1 bbb 38 3 ccc 39 dtype: object
Series数据处理
2)DataFrame
对于DataFrame事情稍微复杂,有时希望删除全NaN或者含有NaN的行或列。
1 In [15]: df = pd.DataFrame([[1,6.5,3],[1,np.nan,np.nan],[np.nan,np.nan,np.nan],[ 2 np.nan,6.5,3]]) 3 4 In [16]: df 5 Out[16]: 6 0 1 2 7 0 1 6.5 3 8 1 1 NaN NaN 9 2 NaN NaN NaN 10 3 NaN 6.5 3 11 12 In [17]: df.dropna() #默认以行(axis=0),只要有NaN的就删除 13 Out[17]: 14 0 1 2 15 0 1 6.5 3 16 17 In [19]: df.dropna(how=‘all‘) #只删除全是NaN的行 18 Out[19]: 19 0 1 2 20 0 1 6.5 3 21 1 1 NaN NaN 22 3 NaN 6.5 3 23 24 In [21]: df.dropna(axis=1,how=‘all‘) #以列为标准来丢弃列 25 Out[21]: 26 0 1 2 27 0 1 6.5 3 28 1 1 NaN NaN 29 2 NaN NaN NaN 30 3 NaN 6.5 3 31 32 In [22]: df.dropna(axis=1) 33 Out[22]: 34 Empty DataFrame 35 Columns: [] 36 Index: [0, 1, 2, 3]
DataFrame的数据处理
二)填充NaN:fillna方法
1 In [88]: df 2 Out[88]: 3 one two 4 a 1.40 NaN 5 b 7.10 -4.5 6 c NaN NaN 7 d 0.75 -1.3 8 9 In [90]: df.fillna(0) 10 Out[90]: 11 one two 12 a 1.40 0.0 13 b 7.10 -4.5 14 c 0.00 0.0 15 d 0.75 -1.3
填充NaN
十一:层次化索引
1 In [30]: seri = pd.Series(np.random.randn(10),index=[[‘a‘,‘a‘,‘a‘,‘b‘,‘b‘,‘b‘,‘c 2 ‘,‘c‘,‘d‘,‘d‘],[1,2,3,1,2,3,1,2,2,3]]) 3 In [31]: seri 4 Out[31]: 5 a 1 0.528387 6 2 -0.152286 7 3 -0.776540 8 b 1 0.025425 9 2 -1.412776 10 3 0.969498 11 c 1 0.478260 12 2 0.116301 13 d 2 1.464144 14 3 2.266069 15 dtype: float64 16 17 In [32]: seri[‘a‘] 18 Out[32]: 19 1 0.528387 20 2 -0.152286 21 3 -0.776540 22 dtype: float64 23 24 In [33]: seri.index 25 Out[33]: 26 MultiIndex(levels=[[u‘a‘, u‘b‘, u‘c‘, u‘d‘], [1, 2, 3]], 27 labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2 28 ]]) 29 30 In [35]: seri[‘a‘:‘c‘] 31 Out[35]: 32 a 1 0.528387 33 2 -0.152286 34 3 -0.776540 35 b 1 0.025425 36 2 -1.412776 37 3 0.969498 38 c 1 0.478260 39 2 0.116301 40 dtype: float64 41 42 In [45]: seri.unstack() 43 Out[45]: 44 1 2 3 45 a 0.528387 -0.152286 -0.776540 46 b 0.025425 -1.412776 0.969498 47 c 0.478260 0.116301 NaN 48 d NaN 1.464144 2.266069 49 50 In [46]: seri.unstack().stack() 51 Out[46]: 52 a 1 0.528387 53 2 -0.152286 54 3 -0.776540 55 b 1 0.025425 56 2 -1.412776 57 3 0.969498 58 c 1 0.478260 59 2 0.116301 60 d 2 1.464144 61 3 2.266069 62 dtype: float64
Series层次化索引,利用unstack方法可以转化为DataFrame型数据
1 In [48]: df = pd.DataFrame(np.arange(12).reshape((4,3)),index=[[‘a‘,‘a‘,‘b‘,‘b‘] 2 ,[1,2,1,2]],columns=[[‘Ohio‘,‘Ohio‘,‘Colorado‘],[‘Green‘,‘Red‘,‘Green‘]]) 3 4 In [49]: df 5 Out[49]: 6 Ohio Colorado 7 Green Red Green 8 a 1 0 1 2 9 2 3 4 5 10 b 1 6 7 8 11 2 9 10 11 12 13 In [50]: df.index 14 Out[50]: 15 MultiIndex(levels=[[u‘a‘, u‘b‘], [1, 2]], 16 labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) 17 18 In [51]: df.columns 19 Out[51]: 20 MultiIndex(levels=[[u‘Colorado‘, u‘Ohio‘], [u‘Green‘, u‘Red‘]], 21 labels=[[1, 1, 0], [0, 1, 0]]) 22 23 In [53]: df[‘Ohio‘] 24 Out[53]: 25 Green Red 26 a 1 0 1 27 2 3 4 28 b 1 6 7 29 2 9 10 30 31 In [57]: df.ix[‘a‘,‘Ohio‘] 32 Out[57]: 33 Green Red 34 1 0 1 35 2 3 4 36 37 In [61]: df.ix[‘a‘,‘Ohio‘].ix[1,‘Red‘] 38 Out[61]: 1
DataFrame层次化索引