问题导读:
1.pandas的数据结构介绍
2.利用pandas对数据的基本操作
3.汇总和计算描述统计方法
4.处理缺失数据
5.层次化索引
解决方案:
pandas的数据结构介绍:
(1)Series
- Series 是一种类似于一维数组的对象,它由一组数据(各种Numpy数据类型)以及与之相关的数据标签(索引)组成
In [6]: obj = pd.Series([1,2,3,4]) In [7]: obj Out[7]: 0 1 1 2 2 3 3 4 dtype: int64
- Series 字符串的表现形式为:索引在左边,值在右边
In [8]: obj.values Out[8]: array([1, 2, 3, 4]) In [9]: obj.index Out[9]: Int64Index([0, 1, 2, 3], dtype='int64')
- 创建的Series 带有一个可以对各个数据点进行标记的索引
In [10]: obj2 = pd.Series([2,4,1,6],index=['a','b','c','d']) In [11]: obj2 Out[11]: a 2 b 4 c 1 d 6 dtype: int64 In [12]: obj2.c Out[12]: 1
- 计算(根据bool型数据过滤、标量乘法、应用数学函数)之后index 和 values 依然保持连接
In [18]: obj2[obj2 > 0] Out[18]: a 2 b 4 c 1 d 6 dtype: int64 In [19]: obj2 * 2 Out[19]: a 4 b 8 c 2 d 12 dtype: int64 In [20]: np.exp(obj2) Out[20]: a 7.389056 b 54.598150 c 2.718282 d 403.428793 dtype: float64
- 通过python字典直接创建Series
In [21]: sdata = {'Ohio':2000, 'Texas':3000, 'Utah':3425, 'Oregon':3908} In [22]: obj3 = pd.Series(sdata) In [23]: obj3 Out[23]: Ohio 2000 Oregon 3908 Texas 3000 Utah 3425 dtype: int64 In [24]: obj3.values Out[24]: array([2000, 3908, 3000, 3425]) In [25]: obj3.index Out[25]: Index([u'Ohio', u'Oregon', u'Texas', u'Utah'], dtype='object')
- 如果只传入一个字典,则结果中的索引就是按照原字典的键(有序排列)
- 如果传入的index 列表中的元素和字典中的键相互匹配不到,则该索引位置的values 为NaN
n [26]: states = ['california','Ohio','Oregon','Texas'] In [27]: obj4 = pd.Series(sdata,index = states) In [28]: obj4 Out[28]: california NaN Ohio 2000 Oregon 3908 Texas 3000 dtype: float64
- pandas 的isnull 和notnull 函数可以用于检测缺失数据
In [34]: obj4.isnull() Out[34]: california True Ohio False Oregon False Texas False dtype: bool In [35]: obj4.notnull() Out[35]: california False Ohio True Oregon True Texas True dtype: bool
- Series 一个重要的功能是:它在运算中会自动对齐不同索引的数据
In [36]: obj3 Out[36]: Ohio 2000 Oregon 3908 Texas 3000 Utah 3425 dtype: int64 In [37]: obj4 Out[37]: california NaN Ohio 2000 Oregon 3908 Texas 3000 dtype: float64 In [38]: obj3 + obj4 Out[38]: Ohio 4000 Oregon 7816 Texas 6000 Utah NaN california NaN dtype: float64
- Series 对象本身 及其索引都有一个name 属性
- Series 的索引可以通过赋值的方式就地修改
In [40]: obj4.name = 'population' In [41]: obj4.index.name = 'state' In [42]: obj4 Out[42]: state california NaN Ohio 2000 Oregon 3908 Texas 3000 Name: population, dtype: float64 In [43]: obj Out[43]: 0 1 1 2 2 3 3 4 dtype: int64 In [45]: obj.index = ['a','b','c','d'] In [46]: obj Out[46]: a 1 b 2 c 3 d 4 dtype: int64
(2)DataFrame
- DataFrame 是一个表格型的数据结构,每列可以是不同的值类型(数值,字符串,布尔值)。
- DataFrame 可以被看做由Series组成的字典(共用同一个索引)
- 通过字典创建DataFrame:每一columns对应的是字典中每个key - value 对,有index 行
In [16]: data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year':[2000, 2001, 2002, 2003, 2004], 'pop':[1.5, 1.7, 3.6, 2.4, 2.9]} In [17]: data Out[17]: {'pop': [1.5, 1.7, 3.6, 2.4, 2.9], 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2003, 2004]} In [18]: frame01 = pd.DataFrame(data,columns = ['year','state','pop'],index = ['1','2','3','4','5']) In [19]: frame01 Out[19]: year state pop 1 2000 Ohio 1.5 2 2001 Ohio 1.7 3 2002 Ohio 3.6 4 2003 Nevada 2.4 5 2004 Nevada 2.9
- 如何指定了列序列,则DataFrame的列就会按照制定顺序进行排序
- 和Series 一样,如果传入的列在数据中找不到,就会产生NA值
- 获取列,获得的是一个series
In [31]: frame01['year'] Out[31]: 1 2000 2 2001 3 2002 4 2003 5 2004 Name: year, dtype: int64 In [32]: frame01['pop'] Out[32]: 1 1.5 2 1.7 3 3.6 4 2.4 5 2.9 Name: pop, dtype: float64 In [33]: frame01.state Out[33]: 1 Ohio 2 Ohio 3 Ohio 4 Nevada 5 Nevada Name: state, dtype: object
- 行可以通过索引字段ix获取
In [5]: frame01.ix[1] Out[5]: year 2001 state Ohio pop 1.7 Name: 2, dtype: object In [6]: frame01.ix['1'] Out[6]: year 2000 state Ohio pop 1.5 Name: 1, dtype: object
- 列的索引依然为DataFrame 的索引
In [35]: frame02 Out[35]: state year pop one Ohio 2000 1.5 two Ohio 2001 1.7 three Ohio 2002 3.6 four Nevada 2003 2.4 five Nevada 2004 2.9 In [36]: frame02.state Out[36]: one Ohio two Ohio three Ohio four Nevada five Nevada Name: state, dtype: object In [37]: frame02.year Out[37]: one 2000 two 2001 three 2002 four 2003 five 2004 Name: year, dtype: int64
- 修改元素
- 列可以通过赋值的方式进行修改
- 将列表或数组赋值给某个列时,其长度必须和DataFrame的长度相匹配
- 如果赋值的是一个Series,就会精确匹配DataFrame的索引,所有的空位都将被填上缺失值
In [22]: frame Out[22]: state year pop one Ohio 1 1 two Ohio 1 1 three Ohio 1 1 four Nevada 1 1 five Nevada 1 1 In [23]: frame.year = np.arange(2011,2016) In [24]: frame.pop = np.random.randn(5) In [25]: frame Out[25]: state year pop one Ohio 2011 0.968678 two Ohio 2012 0.280024 three Ohio 2013 -0.396301 four Nevada 2014 -0.132369 five Nevada 2015 -0.181226 In [40]: frame Out[40]: state year pop debt one Ohio 1 1 NaN two Ohio 1 1 NaN three Ohio 1 1 NaN four Nevada 1 1 NaN five Nevada 1 1 NaN In [41]: val = pd.Series([-1.2,-1.5,-1.7], index = ['two', 'four', 'five']) In [42]: frame['dedt'] = val In [43]: frame Out[43]: state year pop debt dedt one Ohio 1 1 NaN NaN two Ohio 1 1 NaN -1.2 three Ohio 1 1 NaN NaN four Nevada 1 1 NaN -1.5 five Nevada 1 1 NaN -1.7
- 删除列
In [44]: frame['newline'] = frame.state != 'Ohio' In [45]: frame Out[45]: state year pop debt dedt newline one Ohio 1 1 NaN NaN False two Ohio 1 1 NaN -1.2 False three Ohio 1 1 NaN NaN False four Nevada 1 1 NaN -1.5 True five Nevada 1 1 NaN -1.7 True In [46]: del frame['newline'] In [47]: frame Out[47]: state year pop debt dedt one Ohio 1 1 NaN NaN two Ohio 1 1 NaN -1.2 three Ohio 1 1 NaN NaN four Nevada 1 1 NaN -1.5 five Nevada 1 1 NaN -1.7
- 补充说明
(1)DataFrame的index 和 columns 的 name 属性
In [69]: frame.columns.name = 'state' In [70]: frame.index.name = 'myname' In [71]: frame.columns.name = 'state' In [72]: frame Out[72]: state state year pop debt dedt myname one Ohio 1 1 NaN NaN two Ohio 1 1 NaN -1.2 three Ohio 1 1 NaN NaN four Nevada 1 1 NaN -1.5 five Nevada 1 1 NaN -1.7
(2)values 属性也会以二维ndarray 的形式返回DataFrame 中的数据
In [73]: frame.values Out[73]: array([['Ohio', '1', '1', nan, nan], ['Ohio', '1', '1', nan, -1.2], ['Ohio', '1', '1', nan, nan], ['Nevada', '1', '1', nan, -1.5], ['Nevada', '1', '1', nan, -1.7]], dtype=object)
(3)可以输入给DataFrame构造器的数据
(3)索引对象
- pandas 的索引对象负责管理轴标签和其他元数据(比如轴名称)
n [4]: obj = pd.Series(range(3),index=['a','b','c']) In [5]: obj Out[5]: a 0 b 1 c 2 dtype: int64 In [6]: index = obj.index In [7]: index Out[7]: Index([u'a', u'b', u'c'], dtype='object') In [8]: index[1:] Out[8]: Index([u'b', u'c'], dtype='object') In [9]: index[
- Index 对象是不可以修改的
- 不可修改性很重要,因为这样才能使Index对象在多个数据结构之间安全共享
n [4]: obj = pd.Series(range(3),index=['a','b','c']) In [5]: obj Out[5]: a 0 b 1 c 2 dtype: int64 In [6]: index = obj.index In [7]: index Out[7]: Index([u'a', u'b', u'c'], dtype='object') In [8]: index[1:] Out[8]: Index([u'b', u'c'], dtype='object')
In [42]: index Out[42]: Index([u'a', u'b', u'c'], dtype='object') In [43]: obj2 = pd.Series(range(3),index = index) In [44]: obj2.index obj2.index In [44]: obj2.index is index Out[44]: True In [45]: index Out[45]: Index([u'a', u'b', u'c'], dtype='object') In [46]: obj2 = pd.Series(range(3),index = index) In [47]: obj2.index is index Out[47]: True
- Python中的对象包含三要素:id、type、value
其中id用来唯一标识一个对象,type标识对象的类型,value是对象的值
is判断的是a对象是否就是b对象,是通过id来判断的
==判断的是a对象的值是否和b对象的值相等,是通过value来判断的
In [48]: index2=index In [49]: index2 Out[49]: Index([u'a', u'b', u'c'], dtype='object') In [50]: obj2.index is index2 Out[50]: True In [51]: index3 = ['a','b','c'] In [52]: index3 is index Out[52]: False In [53]: obj2.index is index3 Out[53]: False
- pandas中主要的Index对象
- Index的功能也类似一个固定大小的集合
In [28]: frame Out[28]: Nevada Ohio 2000 2.2 1.5 2001 2.1 1.7 2002 2.1 3.6 In [29]: 'Ohio' in frame.columns Out[29]: True In [30]: 2002 in frame.index Out[30]: True In [31]: 1999 in frame.index Out[31]: False
- Index的方法和属性
pandas数据处理基本方法:
(1)重新索引
- pandas对象的一个重要的方法是reindex,其作用是创建一个适应新索引的新对象
- 调用Series 的 reindex 将会根据新索引进行重排,如果某个索引值当前不存在,就引入缺失值
In [38]: obj = pd.Series([4.5,7.2,-5.3,3.8], index = ['d','b','a','c']) In [39]: obj Out[39]: d 4.5 b 7.2 a -5.3 c 3.8 dtype: float64 In [40]: obj2 = obj.reindex(['a','b','c','d','e']) In [41]: obj2 Out[41]: a -5.3 b 7.2 c 3.8 d 4.5 e NaN dtype: float64
- 为了防止索引值不存在的情况可以用 fill_value 值来设置无索引的默认值
In [42]: obj2 = obj.reindex(['a','b','c','d','e'],fill_value = 0) In [43]: obj2 Out[43]: a -5.3 b 7.2 c 3.8 d 4.5 e 0.0 dtype: float64
- 对于时间序列这样的有序数据,重新索引时可以需要做一些插值操作处理,method选项即可达到此目的
- ffill可以实现向前值填充
In [44]: obj3 = pd.Series(['blue','purple','yellow'],index=[0,2,4]) In [45]: obj3.reindex(range(6),method='ffill') Out[45]: 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object
- reindex的(插值)method选项
- 对于DataFrame,reindex可以修改(行)索引,列,或两个都修改
- 如果仅仅传入一个序列,则会重新索引行
In [47]: frame = pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'], ....: columns=['Ohio','Texas','California']) In [48]: frame Out[48]: Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 In [49]: frame2 = frame.reindex(['a','b','c','d']) In [50]: frame2 Out[50]: Ohio Texas California a 0 1 2 b NaN NaN NaN c 3 4 5 d 6 7 8 In [51]: states = ['Texas','Utah','California'] In [52]: frame.reindex(columns=states) Out[52]: Texas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8
- 对行和列同时进行索引,而插值则只能按行应用(即轴0)
In [54]: frame.reindex(index=['a','b','c','d'],method = 'ffill', ....: columns = states) Out[54]: Texas Utah California a 1 NaN 2 b 1 NaN 2 c 4 NaN 5 d 7 NaN 8
- 利用ix的标签索引功能索引
In [56]: frame.ix[['a','c'],'Texas'] Out[56]: a 1 c 4 Name: Texas, dtype: int64
- reindex函数的参数
(2)丢弃指定轴上的项
- drop方法返回的是一个在指定轴上删除了指定值的新对象
In [58]: obj = pd.Series(np.arange(5.),index=['a','b','c','d','e']) In [59]: new_obj = obj.drop('c') In [60]: obj Out[60]: a 0 b 1 c 2 d 3 e 4 dtype: float64 In [61]: new_obj Out[61]: a 0 b 1 d 3 e 4 dtype: float64
- DataFrame可以删除任意轴上的索引值
In [68]: data = pd.DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'], columns=['one','two','three','four']) In [69]: data Out[69]: one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 In [70]: data.drop(['Colorado','Ohio']) Out[70]: one two three four Utah 8 9 10 11 New York 12 13 14 15 In [71]: data.drop('two',axis=1) Out[71]: one three four Ohio 0 2 3 Colorado 4 6 7 Utah 8 10 11 New York 12 14 15 In [72]: data.drop(['one','four'],axis=1) Out[72]: two three Ohio 1 2 Colorado 5 6 Utah 9 10 New York 13 14
(3)索引、选取和过滤
- Series 索引工作方式类似与numpy数组的索引,只不过Series的索引值不只是整数
In [3]: obj = pd.Series(np.arange(4.),index=['a','b','c','d']) In [4]: obj Out[4]: a 0 b 1 c 2 d 3 dtype: float64 In [5]: obj['b'] Out[5]: 1.0 In [6]: obj[1] Out[6]: 1.0 In [7]: obj[2:4] Out[7]: c 2 d 3 dtype: float64 In [8]: obj[['b','d']] Out[8]: b 1 d 3 dtype: float64 In [9]: obj[['1','3']] Out[9]: 1 NaN 3 NaN dtype: float64 In [10]: obj[[1,3]] Out[10]: b 1 d 3 dtype: float64 In [11]: obj[obj<2] Out[11]: a 0 b 1 dtype: float64
- 利用标签的切片运算与普通的python切片运算不同,其末端是包含的,而普通的切片不包含
In [18]: obj[0:3] Out[18]: a 0 b 1 c 2 dtype: float64 In [19]: obj['a':'d'] Out[19]: a 0 b 1 c 2 d 3 dtype: float64
- 切片赋值
In [21]: obj['a':'c'] = 1 In [22]: obj Out[22]: a 1 b 1 c 1 d 3 dtype: float64
- 对DataFrame进行索引就是获取一个或多个列
- 索引方式有几个特殊的情况。首先通过切片或布尔型数组选取行
In [24]: data = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','Colorado','Utah','New York'], ....: columns=['one','two','three','four']) In [25]: data Out[25]: one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 [4 rows x 4 columns] In [26]: data[:2] Out[26]: one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 [2 rows x 4 columns] In [27]: data[data['three'] > 5] Out[27]: one two three four Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 [3 rows x 4 columns]
- 上面那种方式比较不符合逻辑,毕竟来源于实践
- 另一种方式可以通过布尔型DataFrame来进行索引
In [38]: data < 5 Out[38]: one two three four Ohio True True True True Colorado True False False False Utah False False False False New York False False False False [4 rows x 4 columns] In [39]: data[data < 5] = 0 In [40]: data Out[40]: one two three four Ohio 0 0 0 0 Colorado 0 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 [4 rows x 4 columns]
- 使用ix专门的索引字段
In [44]: data.ix['Colorado',['tow','three']] Out[44]: tow NaN three 6 Name: Colorado, dtype: float64 In [45]: data.ix['Colorado',['two','three']] Out[45]: two 5 three 6 Name: Colorado, dtype: int64 In [46]: data.ix[['Colorado','Utah'],[3,1,0]] Out[46]: four two one Colorado 7 5 0 Utah 11 9 8 [2 rows x 3 columns] In [47]: data.ix[2] Out[47]: one 8 two 9 three 10 four 11 Name: Utah, dtype: int64 In [48]: data.ix[:'Utah','two'] Out[48]: Ohio 0 Colorado 5 Utah 9 Name: two, dtype: int64 In [49]: data.ix[data.three > 5,:3] Out[49]: one two three Colorado 0 5 6 Utah 8 9 10 New York 12 13 14 [3 rows x 3 columns]
- DataFrame的索引选项:
(4)算数运算和数据对齐
- pandas最重要的一个功能是可以对不同索引值的对象进行算数运算。
- 如果存在不同的索引对,则结果的索引就是该索引对的并集
In [51]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index = ['a','c','d','e']) In [52]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index = ['a','c','e','f','g']) In [53]: s1 Out[53]: a 7.3 c -2.5 d 3.4 e 1.5 dtype: float64 In [54]: s2 Out[54]: a -2.1 c 3.6 e -1.5 f 4.0 g 3.1 dtype: float64 In [55]: s1 + s2 Out[55]: a 5.2 c 1.1 d NaN e 0.0 f NaN g NaN dtype: float64
- 对于DataFrame,对齐操作会同时发生在行和列上
In [61]: df1 = pd.DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'), index = ['Ohio','Texas','Colorado']) In [62]: df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'), index = ['Utah','Ohio','Texas','Oregon']) In [63]: df1 Out[63]: b c d Ohio 0 1 2 Texas 3 4 5 Colorado 6 7 8 [3 rows x 3 columns] In [64]: df2 Out[64]: b d e Utah 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11 [4 rows x 3 columns] In [65]: df1 + df2 Out[65]: b c d e Colorado NaN NaN NaN NaN Ohio 3 NaN 6 NaN Oregon NaN NaN NaN NaN Texas 9 NaN 12 NaN Utah NaN NaN NaN NaN [5 rows x 4 columns]
4.1 在算数方法中填充值
- 在对不同索引对象进行算数运算时,你可能希望当一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值
In [66]: df1 = pd.DataFrame(np.arange(12.).reshape((3,4)),columns=list('abcd')) In [67]: df2 = pd.DataFrame(np.arange(20.).reshape((4,5)),columns=list('abcde')) In [68]: df1 + df2 Out[68]: a b c d e 0 0 2 4 6 NaN 1 9 11 13 15 NaN 2 18 20 22 24 NaN 3 NaN NaN NaN NaN NaN [4 rows x 5 columns] In [69]: df1.add(df2,fill_value=0) Out[69]: a b c d e 0 0 2 4 6 4 1 9 11 13 15 9 2 18 20 22 24 14 3 15 16 17 18 19 [4 rows x 5 columns]
- 与此类似,Series 或 DataFrame 重新索引时,也可以指定一个填充值
In [72]: df1.reindex(columns=df2.columns,fill_value=0) Out[72]: a b c d e 0 0 1 2 3 0 1 4 5 6 7 0 2 8 9 10 11 0 [3 rows x 5 columns]
- 算术方法
4.2 DataFrame和Series之间的运算
- 先来计算一个二维数组与其某行之间的差
- 这样的运算叫做广播
In [73]: arr = np.arange(12.).reshape((3,4)) In [74]: arr Out[74]: array([[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.]]) In [75]: arr[0] Out[75]: array([ 0., 1., 2., 3.]) In [76]: array([0.,1.,2.,3.]) Out[76]: array([ 0., 1., 2., 3.]) In [77]: arr - arr[0] Out[77]: array([[ 0., 0., 0., 0.], [ 4., 4., 4., 4.], [ 8., 8., 8., 8.]])
- DataFrame与Series 之间的运算和上面差不多
- 默认情况下,DataFrame 和 Series 之间的算术运算会将Series的索引匹配到DataFrame的列,然后沿着行向下广播
In [78]: frame = pd.DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon']) In [79]: series = frame.ix[0] In [80]: frame Out[80]: b d e Utah 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11 [4 rows x 3 columns] In [81]: series Out[81]: b 0 d 1 e 2 Name: Utah, dtype: float64 In [82]: frame - series Out[82]: b d e Utah 0 0 0 Ohio 3 3 3 Texas 6 6 6 Oregon 9 9 9 [4 rows x 3 columns]
- 如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集
In [89]: series2 = pd.Series(range(3),index=['b','e','f']) In [90]: frame + series2 Out[90]: b d e f Utah 0 NaN 3 NaN Ohio 3 NaN 6 NaN Texas 6 NaN 9 NaN Oregon 9 NaN 12 NaN [4 rows x 4 columns]
- 如果你希望匹配行且在列上广播,则必须使用算术运算方法
In [98]: series3 = frame['d'] In [99]: frame Out[99]: b d e Utah 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11 [4 rows x 3 columns] In [100]: series3 Out[100]: Utah 1 Ohio 4 Texas 7 Oregon 10 Name: d, dtype: float64 In [101]: frame.sub(series3,axis=0) Out[101]: b d e Utah -1 0 1 Ohio -1 0 1 Texas -1 0 1 Oregon -1 0 1 [4 rows x 3 columns]
(5)函数应用和映射
- NumPy 的ufuncs(元素级数组方法)也可用于操作pandas对象
- abs() 函数返回数字的绝对值
<span style="font-family:Microsoft Yahei, Helvetica Neue, Helvetica, Arial, sans-serif;color:#333333;"><span style="line-height: 24px;">In [102]: frame = pd.DataFrame(np.random.randn(4,3),columns=list('bde'), .....: index=['Utah','Ohio','Texas','Oregon']) In [103]: frame Out[103]: b d e Utah 0.752307 -1.701850 0.231538 Ohio -1.156549 -0.116461 -0.224967 Texas -1.919040 0.287937 0.580048 Oregon -1.817225 1.545293 -0.606694 [4 rows x 3 columns] In [104]: np.abs(frame) Out[104]: b d e Utah 0.752307 1.701850 0.231538 Ohio 1.156549 0.116461 0.224967 Texas 1.919040 0.287937 0.580048 Oregon 1.817225 1.545293 0.606694 [4 rows x 3 columns]</span></span>
- DataFrame的apply 方法 将函数应用到由各列或行所形成的一维数组上
-
lambda只是一个表达式,函数体比def简单很多。
lambda的主体是一个表达式,而不是一个代码块。仅仅能在lambda表达式中封装有限的逻辑进去。
lambda表达式是起到一个函数速写的作用。允许在代码内嵌入一个函数的定义。
In [106]: f = lambda x:x.max() - x.min() In [107]: frame.apply(f) Out[107]: b 2.671347 d 3.247143 e 1.186742 dtype: float64 In [108]: frame.apply(f,axis=1) Out[108]: Utah 2.454157 Ohio 1.040089 Texas 2.499088 Oregon 3.362518 dtype: float64
- 除了标量外,传递给apply的函数还可以返回由多个值组成的Series
In [120]: def f(x): return pd.Series([x.min(),x.max()],index=['min','max']) .....: In [121]: frame.apply(f) Out[121]: b d e min -1.919040 -1.701850 -0.606694 max 0.752307 1.545293 0.580048 [2 rows x 3 columns]
- 元素级的Python函数也是可以用的
- format 是 得到浮点值的格式化字符串
- 使用applymap函数
In [123]: format = lambda x:'%.2f' % x In [124]: frame.applymap(format) Out[124]: b d e Utah 0.75 -1.70 0.23 Ohio -1.16 -0.12 -0.22 Texas -1.92 0.29 0.58 Oregon -1.82 1.55 -0.61 [4 rows x 3 columns]
- Series有个应用与元素级别的函数的map方法
In [125]: frame['e'].map(format) Out[125]: Utah 0.23 Ohio -0.22 Texas 0.58 Oregon -0.61 Name: e, dtype: object
- 要对行或列索引进行排序(按字典顺序),可以使用sort_index方法,它将返回一个已排序的新对象
In [127]: obj = pd.Series(range(4),index = ['b','a','d','c']) In [128]: obj.sort_index() Out[128]: a 1 b 0 c 3 d 2 dtype: int64
- 对于DataFrame则可以根据任意一个轴上的索引进行排序
In [132]: frame Out[132]: d a b c three 0 1 2 3 one 4 5 6 7 [2 rows x 4 columns] In [133]: frame.sort_index() Out[133]: d a b c one 4 5 6 7 three 0 1 2 3 [2 rows x 4 columns] In [134]: frame.sort_index(axis=1) Out[134]: a b c d three 1 2 3 0 one 5 6 7 4 [2 rows x 4 columns]
- 数据默认是按照升序排序的,但也可以降序排序
In [135]: frame.sort_index(axis=1,ascending=False) Out[135]: d c b a three 0 3 2 1 one 4 7 6 5 [2 rows x 4 columns]
- 如果按值对Series进行排序,可使用其order方法
In [137]: obj = pd.Series([4,7,-3,2]) In [138]: obj.order() Out[138]: 2 -3 3 2 0 4 1 7 dtype: int64
- 在排序时,任何缺失值默认都会被放在Series的末尾
In [139]: obj = pd.Series([4,np.nan,7,np.nan,-3,2]) In [140]: obj.order() Out[140]: 4 -3 5 2 0 4 2 7 1 NaN 3 NaN dtype: float64
- 在DataFrame 上,你可能希望根据一个或多个列中的值进行排序
- 我们只要将一个或多个列中的名字传递给by选项即可
In [141]: frame = pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]}) In [142]: frame Out[142]: a b 0 0 4 1 1 7 2 0 -3 3 1 2 [4 rows x 2 columns] In [143]: frame.sort_index(by='b') Out[143]: a b 2 0 -3 3 1 2 0 0 4 1 1 7 [4 rows x 2 columns] In [144]: frame.sort_index(by=['a','b']) Out[144]: a b 2 0 -3 0 0 4 3 1 2 1 1 7 [4 rows x 2 columns]
- 排名:与排序关系密切,它会增设一个排名值(从1开始,一直到数组中有效的数据);有时候它会根据某种规则破坏平级关系
- 默认情况下rank是通过‘为各组分配一个平均排名’的方式破坏平级关系
In [153]: obj = pd.Series([7,-5,7,4,2,0,4]) In [154]: obj.rank() Out[154]: 0 6.5 1 1.0 2 6.5 3 4.5 4 3.0 5 2.0 6 4.5 dtype: float64
- 要根据原数据中顺序给出排名
- 也可以按降序进行排名
- DataFrame 可以在行或列上计算排名(只需要声明axis即可),这里不再示范了
In [155]: obj.rank(method='first') Out[155]: 0 6 1 1 2 7 3 4 4 3 5 2 6 5 dtype: float64 In [156]: obj.rank(ascending=False,method='max') Out[156]: 0 2 1 7 2 2 3 4 4 5 5 6 6 4 dtype: float64
- 排名时用于破坏平级关系的method选项
(6)带有重复值的轴索引
- 许多pandas函数(如reindex)都要求标签唯一,但这并不是强制性
- 索引的is_unique属性可以告诉你它的值是否是唯一
In [157]: obj = pd.Series(range(5),index = ['a','a','b','b','c']) In [158]: obj Out[158]: a 0 a 1 b 2 b 3 c 4 dtype: int64 In [159]: obj.index.is_unique Out[159]: False
- 对于带有重复值的索引,数据选取的行为将会有些不同
- 某个索引对应多个值,将会返回一个Series
- 而对应单个值,则返回一个标量值
In [161]: obj['a'] Out[161]: a 0 a 1 dtype: int64 In [162]: obj['b'] Out[162]: b 2 b 3 dtype: int64
- 对DataFrame 进行索引的时候也是这样
- 对于重复索引将返回一个DataFrame
In [163]: df = pd.DataFrame(np.random.randn(4,3),index=['a','a','b','b']) In [164]: df Out[164]: 0 1 2 a 0.788165 -0.046659 0.781164 a -1.329601 0.245623 0.074344 b 0.126223 0.141285 -2.280768 b 1.605712 -0.450426 -0.083235 [4 rows x 3 columns] In [165]: df.ix['b'] Out[165]: 0 1 2 b 0.126223 0.141285 -2.280768 b 1.605712 -0.450426 -0.083235 [2 rows x 3 columns]
汇总和计算描述统计:
- pandas对象拥有一组常用的数学和统计方法,他们大部分都属于约简和汇总统计
- 他们都是基于没有缺失数据的假设而构建的
- na值会被自动排除,除非整个切片(整个行或者列)都是na ,通过skipna 选项可以禁用该功能
In [54]: df = pd.DataFrame([[1.4,np.nan],[7.1,-4.5], [np.nan,np.nan],[0.75,-1.3]],index = ['a','b','c','d'], ....: columns=['one','two']) In [55]: df Out[55]: one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 [4 rows x 2 columns] In [56]: df.sum() Out[56]: one 9.25 two -5.80 dtype: float64 In [57]: df.sum(1) Out[57]: a 1.40 b 2.60 c NaN d -0.55 dtype: float64 In [58]: df.mean(axis=1,skipna=False) Out[58]: a NaN b 1.300 c NaN d -0.275 dtype: float64
- 约简方法的选项
- 有些方法返回的是间接统计
- idxmin/idxmax
- 有些方法则是累计型的
In [64]: df.idxmax() Out[64]: one b two d dtype: object In [65]: df.cumsum() Out[65]: one two a 1.40 NaN b 8.50 -4.5 c NaN NaN d 9.25 -5.8 [4 rows x 2 columns]
- describe 即不是约简型也不是累计型它用于一次性产生多个汇总统计
In [67]: df.describe() Out[67]: one two count 3.000000 2.000000 mean 3.083333 -2.900000 std 3.493685 2.262742 min 0.750000 -4.500000 25% 1.075000 -3.700000 50% 1.400000 -2.900000 75% 4.250000 -2.100000 max 7.100000 -1.300000
- 对于非数组型数据,describe会产生另外一种汇总统计
In [69]: obj = pd.Series(['a','a','b','c']*4) In [70]: obj Out[70]: 0 a 1 a 2 b 3 c 4 a 5 a 6 b 7 c 8 a 9 a 10 b 11 c 12 a 13 a 14 b 15 c dtype: object In [71]: obj.describe() Out[71]: count 16 unique 3 top a freq 8 dtype: object
- 描述和汇总统计
(1)相关系数与协方差
- 读取来自Yahoo!Finance的股票价格和成交量
#!/usr/bin/env python # coding=utf-8 import pandas as pd import pandas.io.data as web all_data = {} for ticker in ['AAPL','IBM','MSFT','GOOG']: all_data[ticker] = web.get_data_yahoo(ticker,'1/1/2000','1/1/2010') price = pd.DataFrame({tic:data['Adj Close'] for tic,data in all_data.iteritems()}) volume = pd.DataFrame({tic:data['Volume'] for tic, data in all_data.iteritems()})
- Series的corr方法用于计算两个Series中重叠的,非NA的,按索引对齐的值的相关系数
- cov用于计算协方差
In [33]: returns = price.pct_change() In [34]: returns.tail() Out[34]: AAPL GOOG IBM MSFT Date 2009-12-24 0.034339 0.011117 0.004385 0.002587 2009-12-28 0.012294 0.007098 0.013326 0.005484 2009-12-29 -0.011861 -0.005571 -0.003477 0.007058 2009-12-30 0.012147 0.005376 0.005461 -0.013699 2009-12-31 -0.004300 -0.004416 -0.012597 -0.015504 [5 rows x 4 columns] In [35]: returns.MSFT.corr(returns.IBM) Out[35]: 0.49597969625135241 In [36]: returns.MSFT.cov(returns.IBM) Out[36]: 0.00021595764700046635
- DataFrame的corr 和 cov 方法将以DataFrame的形式返回完整的相关系数或协方差矩阵
In [38]: returns.corr() Out[38]: AAPL GOOG IBM MSFT AAPL 1.000000 0.470676 0.410011 0.424305 GOOG 0.470676 1.000000 0.390689 0.443587 IBM 0.410011 0.390689 1.000000 0.495980 MSFT 0.424305 0.443587 0.495980 1.000000 [4 rows x 4 columns] In [39]: returns.cov() Out[39]: AAPL GOOG IBM MSFT AAPL 0.001027 0.000303 0.000252 0.000309 GOOG 0.000303 0.000580 0.000142 0.000205 IBM 0.000252 0.000142 0.000367 0.000216 MSFT 0.000309 0.000205 0.000216 0.000516 [4 rows x 4 columns]
- 利用DataFrame的corrwidth方法,你可以计算其列或行跟另一个Series或DataFrame之间的相关系数
- 传入一个Series将会返回一个相关系数值Series(针对各列进行计算)
In [40]: returns.corrwith(returns.IBM) Out[40]: AAPL 0.410011 GOOG 0.390689 IBM 1.000000 MSFT 0.495980 dtype: float64
- 传入一个DataFrame则会计算按列名配对的相关系数,这里,我们计算百分比变换与成交量的相关系数
In [43]: returns.corrwith(volume) Out[43]: AAPL -0.057549 GOOG 0.062647 IBM -0.007892 MSFT -0.014245 dtype: float64
- 传入axis=1 即按行进行计算,无论何时,在计算相关系数之前,所有数据项都会按照标签进行对齐
(2)唯一值、值计数以及成员资格
- 这类方法,我们可以从一维Series的值中提取信息
- 第一个函数可以得到Series中的唯一值数组
In [44]: obj = pd.Series(['c','a','d','a','a','b','b','c','c']) In [47]: uniques = obj.unique() In [48]: uniques Out[48]: array(['c', 'a', 'd', 'b'], dtype=object)
- 返回的唯一值是未经排序的,如果需要的话,可以对结果再次进行排序
- value_counts用于计算一个Series中各值出现的频率
In [50]: uniques.sort() In [51]: uniques Out[51]: array(['a', 'b', 'c', 'd'], dtype=object) In [52]: obj.value_counts() Out[52]: c 3 a 3 b 2 d 1 dtype: int64
- value_counts是Series是按值频率降序排列的
- value_counts还是一个顶级pandas方法,可用于任何数组或序列
In [53]: pd.value_counts(obj.values,sort=False) Out[53]: a 3 c 3 b 2 d 1 dtype: int64
- isin函数用于判断矢量化集合的成员资格
- 就是看obj中的每个元素是否在我们传入函数中的那个集合中
In [55]: mask = obj.isin(['a','c']) In [56]: mask Out[56]: 0 True 1 True 2 False 3 True 4 True 5 False 6 False 7 True 8 True dtype: bool In [57]: obj[mask] Out[57]: 0 c 1 a 3 a 4 a 7 c 8 c dtype: object
- 唯一值、值计数、成员资格方法
- 将pandas.value_counts传给该DataFrame的apply函数
- 行索引就是该DataFrame中的所有唯一数据,该Dataframe的值就是出现的频率
In [58]: data = pd.DataFrame({'Qu1':[1,3,4,3,4], ....: 'Qu2':[2,3,1,2,3], ....: 'Qu3':[1,5,2,4,4]}) In [59]: data Out[59]: Qu1 Qu2 Qu3 0 1 2 1 1 3 3 5 2 4 1 2 3 3 2 4 4 4 3 4 [5 rows x 3 columns] In [60]: result = data.apply(pd.value_counts).fillna(0) In [61]: result Out[61]: Qu1 Qu2 Qu3 1 1 1 1 2 0 2 1 3 2 2 0 4 2 0 2 5 0 0 1 [5 rows x 3 columns]
处理缺失数据:
- pandas使用浮点值NaN表示浮点和非浮点数组中的缺失数据,他只是一个便于被检测出来的标记而已
- python 内置的None 值也会被当做NA 处理
In [66]: string_data = pd.Series(['aardvark','artichoke',np.nan,'avocado']) In [67]: string_data Out[67]: 0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object In [69]: string_data.isnull() Out[69]: 0 False 1 False 2 True 3 False dtype: bool In [70]: string_data[0] = None In [71]: string_data.isnull() Out[71]: 0 True 1 False 2 True 3 False dtype: bool
- NA处理方法
(1)滤除缺失数据
- dropna返回一个仅含非空数据和索引值的Series
- 通过布尔型索引也可以达到这个目的
In [72]: from numpy import nan as NA In [73]: data = pd.Series([1,NA,3.5,NA,7]) In [74]: data.dropna() Out[74]: 0 1.0 2 3.5 4 7.0 dtype: float64 In [75]: data[data.notnull()] Out[75]: 0 1.0 2 3.5 4 7.0 dtype: float64
- 对于DataFrame 对象,事情就有点复杂了
- dropna默认丢弃任何含有缺失值的行
- 传入how=‘all‘ 将只丢弃全为NA的那些行
- 传入axis = 1 将删除全为NA的那些列
In [76]: data = pd.DataFrame([[1.,6.5,3.],[1.,NA,NA], ....: [NA,NA,NA],[NA,6.5,3.]]) In [77]: cleaned = data.dropna() In [78]: data Out[78]: 0 1 2 0 1 6.5 3 1 1 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3 [4 rows x 3 columns] In [79]: cleaned Out[79]: 0 1 2 0 1 6.5 3 [1 rows x 3 columns] In [80]: data.dropna(how='all') Out[80]: 0 1 2 0 1 6.5 3 1 1 NaN NaN 3 NaN 6.5 3 [3 rows x 3 columns] In [81]: data.dropna(axis=1,how='all') Out[81]: 0 1 2 0 1 6.5 3 1 1 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3 [4 rows x 3 columns]
- 当你只想留下某些数据时,可以利用thresh 参数来设定
In [90]: df.ix[:4,1] = NA;df.ix[:3,2] = NA In [91]: df Out[91]: 0 1 2 0 0.739899 NaN NaN 1 0.124309 NaN NaN 2 0.960898 NaN NaN 3 -0.059859 NaN NaN 4 -1.917175 NaN -1.122671 5 -0.073825 1.384257 -1.266332 6 0.957087 0.185031 -1.241783 [7 rows x 3 columns] In [92]: df.dropna(thresh=4) Out[92]: Empty DataFrame Columns: [0, 1, 2] Index: [] [0 rows x 3 columns] In [93]: df.dropna(thresh=3) Out[93]: 0 1 2 5 -0.073825 1.384257 -1.266332 6 0.957087 0.185031 -1.241783 [2 rows x 3 columns]
(2)填充缺失数据
- fillna方法
In [97]: df.fillna(0) Out[97]: 0 1 2 0 0.739899 0.000000 0.000000 1 0.124309 0.000000 0.000000 2 0.960898 0.000000 0.000000 3 -0.059859 0.000000 0.000000 4 -1.917175 0.000000 -1.122671 5 -0.073825 1.384257 -1.266332 6 0.957087 0.185031 -1.241783 [7 rows x 3 columns] In [99]: df.fillna({1:0.5,2:-1}) Out[99]: 0 1 2 0 0.739899 0.500000 -1.000000 1 0.124309 0.500000 -1.000000 2 0.960898 0.500000 -1.000000 3 -0.059859 0.500000 -1.000000 4 -1.917175 0.500000 -1.122671 5 -0.073825 1.384257 -1.266332 6 0.957087 0.185031 -1.241783 [7 rows x 3 columns]
- fillna 默认会返回新对象,但也可以对现有对象进行就地修改
In [100]: _ = df.fillna(0,inplace=True) In [101]: df Out[101]: 0 1 2 0 0.739899 0.000000 0.000000 1 0.124309 0.000000 0.000000 2 0.960898 0.000000 0.000000 3 -0.059859 0.000000 0.000000 4 -1.917175 0.000000 -1.122671 5 -0.073825 1.384257 -1.266332 6 0.957087 0.185031 -1.241783 [7 rows x 3 columns]
- fillna函数的参数
层次化索引:
- 使用低维度形式处理高维度数据
- 创建一个Series,并用一个由列表或数组组成的列表作为索引
- 这就是带有MultiIndex索引的Series的格式化输出形式
- 索引之间的间隔表示直接使用上面的标签
- 对于一个层次化索引的对象,选取数据子集的操作很简单
In [105]: data = pd.Series(np.random.randn(10), .....: index = [['a','a','a','b','b','b','c','c','d','d'], .....: [1,2,3,1,2,3,1,2,2,3]]) In [106]: data Out[106]: a 1 -0.237206 2 -0.992311 3 1.685961 b 1 0.987261 2 -1.166006 3 -0.962065 c 1 -1.071484 2 0.393728 d 2 0.793652 3 -0.223266 dtype: float64 In [107]: data.index Out[107]: MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]]) In [108]: data['b'] Out[108]: 1 0.987261 2 -1.166006 3 -0.962065 dtype: float64
- 可以在“内层”中进行选取
In [109]: data[:,2] Out[109]: a -0.992311 b -1.166006 c 0.393728 d 0.793652 dtype: float64
- 层次化索引在数据重塑和基于分组的操作中扮演着重要的角色
- 可以利用unstack 方法被重新安排到一个DataFrame中
In [110]: data.unstack() Out[110]: 1 2 3 a -0.237206 -0.992311 1.685961 b 0.987261 -1.166006 -0.962065 c -1.071484 0.393728 NaN d NaN 0.793652 -0.223266 [4 rows x 3 columns] In [111]: data.unstack().stack() Out[111]: a 1 -0.237206 2 -0.992311 3 1.685961 b 1 0.987261 2 -1.166006 3 -0.962065 c 1 -1.071484 2 0.393728 d 2 0.793652 3 -0.223266 dtype: float64
- 对于一个DataFrame,每条轴都可以有分层索引
- 每层都可以有名字;如果制定了名称,他们就会显示在控制台输出中
- 不要将索引名跟轴标签混为一谈
In [113]: frame = pd.DataFrame(np.arange(12).reshape((4,3)), index = [['a','a','b','b'],[1,2,1,2]], columns = [['Ohio',"Ohio",'Colorado'], ['Green','Red','Green']]) In [114]: frame Out[114]: Ohio Colorado Green Red Green a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 [4 rows x 3 columns] In [115]: frame.index.names = ['key1','key2'] In [116]: frame.columns.names = ['state','color'] In [117]: frame Out[117]: state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 [4 rows x 3 columns]
- 可以单独创建MultiIndex 然后复用,上面那个DataFrame中的(分级)列可以这样创建
In [125]: col = pd.MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],names=['state','color']) In [126]: frame2 = pd.DataFrame(np.arange(12).reshape((4,3)), index = [['a','a','b','b'],[1,2,1,2]], columns = col ) In [127]: frame2 Out[127]: state Ohio Colorado color Green Red Green a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 [4 rows x 3 columns]
(1)重排分级顺序
- swaplevel 接受两个级别编号或名称,并返回一个互换了级别的新对象
In [129]: frame.swaplevel('key1','key2') Out[129]: state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11 [4 rows x 3 columns]
- sortlevel 则根据单个级别中的值对数据进行排序
- 交换级别时,常常也会用到sortlevel
In [130]: frame.sortlevel(1) Out[130]: state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 b 1 6 7 8 a 2 3 4 5 b 2 9 10 11 [4 rows x 3 columns] In [131]: frame Out[131]: state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 [4 rows x 3 columns] In [132]: frame.sortlevel(0) Out[132]: state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 [4 rows x 3 columns]
In [134]: frame.swaplevel(0,1) Out[134]: state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11 [4 rows x 3 columns] In [135]: frame Out[135]: state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 [4 rows x 3 columns] In [136]: frame.swaplevel(0,1).sortlevel(0) Out[136]: state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 b 6 7 8 2 a 3 4 5 b 9 10 11 [4 rows x 3 columns]
(2)根据级别汇总统计
- 我们可以根据行或列上的级别来进行求和
In [137]: frame.sum(level='key2') Out[137]: state Ohio Colorado color Green Red Green key2 1 6 8 10 2 12 14 16 [2 rows x 3 columns] In [138]: frame.sum(level='color',axis=1) Out[138]: color Green Red key1 key2 a 1 2 1 2 8 4 b 1 14 7 2 20 10 [4 rows x 2 columns]
(3)使用DataFrame的列
- DataFrame 的 set_index 函数会将其一个或多个列转换为行索引,并创建一个新的DataFrame
In [139]: frame = pd.DataFrame({'a':range(7),'b':range(7,0,-1), .....: 'c':['one','one','one','two','two','two','two'], .....: 'd':[0,1,2,0,1,2,3]}) In [140]: frame Out[140]: a b c d 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 3 3 4 two 0 4 4 3 two 1 5 5 2 two 2 6 6 1 two 3 [7 rows x 4 columns] In [141]: frame2 = frame.set_index(['c','d']) In [142]: frame2 Out[142]: a b c d one 0 0 7 1 1 6 2 2 5 two 0 3 4 1 4 3 2 5 2 3 6 1 [7 rows x 2 columns]
- 默认情况下,那些列会从DataFrame中移除,但也可以将其保留下来
- reset_index 的功能跟set_index刚好相反,层次化索引的级别会被转移到列里面
In [143]: frame.set_index(['c','d'],drop=False) Out[143]: a b c d c d one 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 two 0 3 4 two 0 1 4 3 two 1 2 5 2 two 2 3 6 1 two 3 [7 rows x 4 columns] In [144]: frame2.reset_index() Out[144]: c d a b 0 one 0 0 7 1 one 1 1 6 2 one 2 2 5 3 two 0 3 4 4 two 1 4 3 5 two 2 5 2 6 two 3 6 1 [7 rows x 4 columns]
时间: 2024-10-07 02:24:17