Pandas库中除了Series, DataFrame这两种常用数据结构外,还有一种Panel数据结构,通常可以用一个由DataFrame对象组成的字典或者一个三维数组来创建Panel对象。
1 # -*- coding: utf-8 -*- 2 """ 3 Created on Sat Mar 26 18:01:05 2016 4 5 @author: Jeremy 6 """ 7 import numpy as np 8 from pandas import Series, DataFrame, Panel 9 import pandas as pd
1 #用一个包含DataFrame的字典来创建Panel对象 2 df = np.random.binomial(100, 0.95,(9,2)) 3 dm = np.random.binomial(100, 0.95,(12,2)) 4 dff = DataFrame(df, columns = [‘Physics‘, ‘Math‘]) 5 dfm = DataFrame(dm, columns = [‘Physics‘, ‘Math‘]) 6 score_panel = Panel({‘Girls‘:dff, ‘Boys‘:dfm})
用print()方法来查看创建的Panel对象score_panel信息:Panel对象有Item axis, Major_axis axis, Minor_axis axis 三个轴,并给出了三个轴的维度和数据大小信息:2*12*2。
>>> print(score_panel) <class ‘pandas.core.panel.Panel‘> Dimensions: 2 (items) x 12 (major_axis) x 2 (minor_axis) Items axis: Boys to Girls Major_axis axis: 0 to 11 Minor_axis axis: Physics to Math
Panel对象索引和二维数组及数据框类似,默认第一个轴为Item axis,可以直接对其进行索引:
1 #提取Girls组学生的物理和数学成绩 2 score_panel[‘Girls‘]
Physics Math 0 91 96 1 97 98 2 98 97 3 93 97 4 95 97 5 97 95 6 95 95 7 90 96 8 92 96 9 NaN NaN 10 NaN NaN 11 NaN NaN
基于ix的标签索引被推广到三个维度,因此我们在三个维度上提取我们想要的数据,例如:
1 ‘‘‘ 2 找出数学成绩不小于93的女生的物理和数学成绩 3 并返回一个数据框DataFrame 4 ‘‘‘ 5 score_panel.ix[‘Girls‘, score_panel.Girls.Math >= 93, :]
Physics Math 0 91 96 1 97 98 2 98 97 3 93 97 4 95 97 5 97 95 6 95 95 7 90 96 8 92 96
下面我们通过建立一个包含多只股票不同时期的股票价格数据的Panel对象来介绍一个Panel对象方法。
1 import pandas.io.data as web 2 pdata = Panel(dict((symbol, web.DataReader(symbol, data_source = ‘yahoo‘, 3 start = ‘1/1/2009‘, end = ‘6/1/2012‘)) for symbol in [‘AAPL‘, ‘GOOG‘, ‘MSFT‘, ‘DELL‘])) 4 print(pdata)
<class ‘pandas.core.panel.Panel‘> Dimensions: 4 (items) x 868 (major_axis) x 6 (minor_axis) Items axis: AAPL to MSFT Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00 Minor_axis axis: Open to Adj Close
1 #提取’2012-06-01‘四支股票的行情数据 2 pdata.ix[:, ‘2012-06-01‘, :]
AAPL DELL GOOG MSFT Open 5.691600e+02 12.15000 571.790972 28.760000 High 5.726500e+02 12.30000 572.650996 28.959999 Low 5.605200e+02 12.04500 568.350996 28.440001 Close 5.609900e+02 12.07000 570.981000 28.450001 Volume 1.302469e+08 19397600.00000 6138700.000000 56634300.000000 Adj Close 7.421812e+01 11.67592 285.205295 25.598227
1 #提取四支股票‘2012-05-30‘至‘2012-06-01‘的收盘价格(Close)数据 2 pdata.ix[:,‘2012-05-30‘:‘2012-06-01‘, ‘Close‘]
AAPL DELL GOOG MSFT Date 2012-05-30 579.169998 12.56 588.230992 29.340000 2012-05-31 577.730019 12.33 580.860990 29.190001 2012-06-01 560.989983 12.07 570.981000 28.450001
1 #提取四支股票‘2012-05-30‘至‘2012-06-01‘的行情数据 2 pdata.ix[:, ‘2012-05-30‘:‘2012-06-01‘, :]
<class ‘pandas.core.panel.Panel‘> Dimensions: 4 (items) x 3 (major_axis) x 6 (minor_axis) Items axis: AAPL to MSFT Major_axis axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00 Minor_axis axis: Open to Adj Close
这里返回的结果与上面不同的原因是这里我们返回的结果任然是一个Panel对象,上面提取的行情数据只包括了收盘价格,第三个维度(坐标轴)消失了,结果返回的是一个二维的数据框DataFrame对象,如果我们希望像DataFrame一样看到整个完整的数据信息,可以用to_frame()方法,它将面板数据以一种“堆积式”的DataFrame形式呈现:
1 #以DataFrame形式呈现面板数据 2 pdata.ix[:, ‘2012-05-30‘:‘2012-06-01‘, :].to_frame()
AAPL DELL GOOG Date minor 2012-05-30 Open 5.692000e+02 12.59000 588.161028 High 5.799900e+02 12.70000 591.901014 Low 5.665600e+02 12.46000 583.530999 Close 5.791700e+02 12.56000 588.230992 Volume 1.323574e+08 19787800.00000 3827600.000000 Adj Close 7.662330e+01 12.14992 293.821674 2012-05-31 Open 5.807400e+02 12.53000 588.720982 High 5.815000e+02 12.54000 590.001032 Low 5.714600e+02 12.33000 579.001013 Close 5.777300e+02 12.33000 580.860990 Volume 1.229186e+08 19955600.00000 5958800.000000 Adj Close 7.643280e+01 11.92743 290.140354 2012-06-01 Open 5.691600e+02 12.15000 571.790972 High 5.726500e+02 12.30000 572.650996 Low 5.605200e+02 12.04500 568.350996 Close 5.609900e+02 12.07000 570.981000 Volume 1.302469e+08 19397600.00000 6138700.000000 Adj Close 7.421812e+01 11.67592 285.205295 MSFT Date minor 2012-05-30 Open 29.350000 High 29.480000 Low 29.120001 Close 29.340000 Volume 41585500.000000 Adj Close 26.399015 2012-05-31 Open 29.299999 High 29.420000 Low 28.940001 Close 29.190001 Volume 39134000.000000 Adj Close 26.264051 2012-06-01 Open 28.760000 High 28.959999 Low 28.440001 Close 28.450001 Volume 56634300.000000 Adj Close 25.598227
DataFrame有一个相应的to_panel()方法,它是to_frame()的逆运算:
1 stacked = pdata.ix[:, ‘2012-05-30‘:‘2012-06-01‘, :].to_frame() 2 stacked.to_panel()
<class ‘pandas.core.panel.Panel‘> Dimensions: 4 (items) x 3 (major_axis) x 6 (minor_axis) Items axis: AAPL to MSFT Major_axis axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00 Minor_axis axis: Open to Adj Close
总结:Panel对象的一个好处是我们可以通过建立一个Panel对象来保存多层次/多维度的数据,当我们需要任意维度的数据来进行建模分析时,随时可以提取出一个Series或者DataFrame。
参考资料:
《利用python进行数据分析》 Wes McKinney著
Computational Statistics in Python : http://people.duke.edu/~ccc14/sta-663/index.html