pandas 新手指引

# 10 Minutes to pandas

pandas入门教程,面向新手,如需高级教程,移步[pandas cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook)

按照约定,一般按照如下形式对pandas进行导入

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 使用ipython notebook绘图,加入如下命令
%matplotlib inline

## pandas 对象的创建

通过python列表构造一个pandas的Series对象

# Series 自动生成索引
s = pd.Series([1,2,3,np.nan, 4,5])
s

0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
dtype: float64

使用numpy的数组创建一个pandas的DataFrame,指定日期序列为行索引,指定’A’,’B’,’C’,’D’为列索引

dates = pd.date_range(‘20160101‘, periods=6)
dates

DatetimeIndex([‘2016-01-01’, ‘2016-01-02’, ‘2016-01-03’, ‘2016-01-04’,
‘2016-01-05’, ‘2016-01-06’],
dtype=’datetime64[ns]’, freq=’D’)

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list(‘ABCD‘))

df
A B C D
2016-01-01 -0.808397 -1.548973 1.013311 1.981536
2016-01-02 1.966543 0.468294 0.168445 -1.474018
2016-01-03 -1.308454 0.625522 -2.465547 1.757797
2016-01-04 -1.430586 -0.732160 -0.034836 0.216295
2016-01-05 -0.519748 0.386824 -2.775289 -0.088892
2016-01-06 1.027911 -0.311089 0.646725 0.773003

或者,可以通过传递字典来创建Dataframe对象

df2 = pd.DataFrame({
        ‘A‘: pd.Timestamp(‘20160701‘),
        ‘B‘: pd.Series(1, index=list(range(4)), dtype=‘float32‘),
        ‘C‘: np.array([3] * 4, dtype=‘int32‘),
        ‘D‘: pd.Categorical([‘Test‘, ‘Train‘, ‘Test‘, ‘Train‘]),
        ‘E‘: 1,
        ‘F‘: ‘foo‘
    })
df2
A B C D E F
0 2016-07-01 1.0 3 Test 1 foo
1 2016-07-01 1.0 3 Train 1 foo
2 2016-07-01 1.0 3 Test 1 foo
3 2016-07-01 1.0 3 Train 1 foo

df2的每一列都拥有不同的类型,可以通过dtypes属性查看

df2.dtypes

A datetime64[ns]
B float32
C int32
D category
E int64
F object
dtype: object

## 查看数据

查看数据的前几行和后几行

# head(n) 方法查看前n行
df.head(3)
A B C D
2016-01-01 -0.808397 -1.548973 1.013311 1.981536
2016-01-02 1.966543 0.468294 0.168445 -1.474018
2016-01-03 -1.308454 0.625522 -2.465547 1.757797
# tail(n) 方法查看后n行
df.tail(2)
A B C D
2016-01-05 -0.519748 0.386824 -2.775289 -0.088892
2016-01-06 1.027911 -0.311089 0.646725 0.773003

查看DataFrame的行列信息和数据信息

df.index

DatetimeIndex([‘2016-01-01’, ‘2016-01-02’, ‘2016-01-03’, ‘2016-01-04’,
‘2016-01-05’, ‘2016-01-06’],
dtype=’datetime64[ns]’, freq=’D’)

df.columns

Index([‘A’, ‘B’, ‘C’, ‘D’], dtype=’object’)

df.values

array([[-0.8083965 , -1.54897301, 1.01331067, 1.98153559],
[ 1.96654297, 0.46829396, 0.16844495, -1.47401779],
[-1.30845444, 0.62552152, -2.46554656, 1.75779664],
[-1.43058558, -0.73216048, -0.03483597, 0.21629514],
[-0.51974796, 0.3868237 , -2.77528915, -0.08889186],
[ 1.02791114, -0.31108897, 0.64672466, 0.77300274]])

简单数据统计信息

df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.178788 -0.185264 -0.574532 0.527620
std 1.372179 0.846927 1.629433 1.278357
min -1.430586 -1.548973 -2.775289 -1.474018
25% -1.183440 -0.626893 -1.857869 -0.012595
50% -0.664072 0.037867 0.066804 0.494649
75% 0.640996 0.447926 0.527155 1.511598
max 1.966543 0.625522 1.013311 1.981536

矩阵的转置

df.T
2016-01-01 00:00:00 2016-01-02 00:00:00 2016-01-03 00:00:00 2016-01-04 00:00:00 2016-01-05 00:00:00 2016-01-06 00:00:00
A -0.808397 1.966543 -1.308454 -1.430586 -0.519748 1.027911
B -1.548973 0.468294 0.625522 -0.732160 0.386824 -0.311089
C 1.013311 0.168445 -2.465547 -0.034836 -2.775289 0.646725
D 1.981536 -1.474018 1.757797 0.216295 -0.088892 0.773003

索引排序

df.sort_index(axis=1, ascending=False)
D C B A
2016-01-01 1.981536 1.013311 -1.548973 -0.808397
2016-01-02 -1.474018 0.168445 0.468294 1.966543
2016-01-03 1.757797 -2.465547 0.625522 -1.308454
2016-01-04 0.216295 -0.034836 -0.732160 -1.430586
2016-01-05 -0.088892 -2.775289 0.386824 -0.519748
2016-01-06 0.773003 0.646725 -0.311089 1.027911

通过某一列值进行排序

df.sort_values(by=‘C‘)
A B C D
2016-01-05 -0.519748 0.386824 -2.775289 -0.088892
2016-01-03 -1.308454 0.625522 -2.465547 1.757797
2016-01-04 -1.430586 -0.732160 -0.034836 0.216295
2016-01-02 1.966543 0.468294 0.168445 -1.474018
2016-01-06 1.027911 -0.311089 0.646725 0.773003
2016-01-01 -0.808397 -1.548973 1.013311 1.981536

## 数据的选择

### 获取数据

# 选取单独一列数据,获取到的数据是Series对象,
# df[‘A‘] 等价与 df.A
df[‘A‘]

2016-01-01 -0.808397
2016-01-02 1.966543
2016-01-03 -1.308454
2016-01-04 -1.430586
2016-01-05 -0.519748
2016-01-06 1.027911
Freq: D, Name: A, dtype: float64

通过切片技术,获取相对应的行. **PS: 末端包含**

df[0:3]
A B C D
2016-01-01 -0.808397 -1.548973 1.013311 1.981536
2016-01-02 1.966543 0.468294 0.168445 -1.474018
2016-01-03 -1.308454 0.625522 -2.465547 1.757797
df[‘20160102‘: ‘20160104‘]
A B C D
2016-01-02 1.966543 0.468294 0.168445 -1.474018
2016-01-03 -1.308454 0.625522 -2.465547 1.757797
2016-01-04 -1.430586 -0.732160 -0.034836 0.216295

通过标签选择数据。 ps:使用 .at, .iat, .loc, .iloc, .ix属性来实现

df.loc[‘20160101‘]

A -0.808397
B -1.548973
C 1.013311
D 1.981536
Name: 2016-01-01 00:00:00, dtype: float64

花式索引,选取多列数据

df.loc[:, [‘A‘,‘B‘]]
A B
2016-01-01 -0.808397 -1.548973
2016-01-02 1.966543 0.468294
2016-01-03 -1.308454 0.625522
2016-01-04 -1.430586 -0.732160
2016-01-05 -0.519748 0.386824
2016-01-06 1.027911 -0.311089

通过标签来切片

df.loc[‘20160102‘:‘20160103‘, [‘B‘,‘C‘]]
B C
2016-01-02 0.468294 0.168445
2016-01-03 0.625522 -2.465547
df.loc[‘20160103‘,[‘A‘, ‘B‘]]

A -1.308454
B 0.625522
Name: 2016-01-03 00:00:00, dtype: float64

获取单个数据

print(df.loc[‘20160101‘, ‘A‘])
# 或者使用.at属性
print(df.at[dates[0], ‘A‘])

-0.808396502432
-0.808396502432

通过位置选择,通过整数坐标来获取数据片段或单个数据,此时切片跟python, numpy一致,即末端不包含。

df.iloc[3]

A -1.430586
B -0.732160
C -0.034836
D 0.216295
Name: 2016-01-04 00:00:00, dtype: float64

df.iloc[3:5, 0:2]
A B
2016-01-04 -1.430586 -0.732160
2016-01-05 -0.519748 0.386824
df.iloc[[1,2,4],[0,2]]
A C
2016-01-02 1.966543 0.168445
2016-01-03 -1.308454 -2.465547
2016-01-05 -0.519748 -2.775289
df.iloc[1:3]
A B C D
2016-01-02 1.966543 0.468294 0.168445 -1.474018
2016-01-03 -1.308454 0.625522 -2.465547 1.757797
df.iloc[:,1:3]
B C
2016-01-01 -1.548973 1.013311
2016-01-02 0.468294 0.168445
2016-01-03 0.625522 -2.465547
2016-01-04 -0.732160 -0.034836
2016-01-05 0.386824 -2.775289
2016-01-06 -0.311089 0.646725
df.iloc[1,1] # 等价与 df.iat{1,1}

0.46829396335234058

### 布尔型索引

df[df.A > 0]
A B C D
2016-01-02 1.966543 0.468294 0.168445 -1.474018
2016-01-06 1.027911 -0.311089 0.646725 0.773003
df[df >0]
A B C D
2016-01-01 NaN NaN 1.013311 1.981536
2016-01-02 1.966543 0.468294 0.168445 NaN
2016-01-03 NaN 0.625522 NaN 1.757797
2016-01-04 NaN NaN NaN 0.216295
2016-01-05 NaN 0.386824 NaN NaN
2016-01-06 1.027911 NaN 0.646725 0.773003

使用 isin 方法来筛选数据

df2 = df.copy()
df2[‘E‘] = [‘one‘, ‘one‘, ‘two‘, ‘three‘, ‘four‘, ‘three‘]
df2
A B C D E
2016-01-01 -0.808397 -1.548973 1.013311 1.981536 one
2016-01-02 1.966543 0.468294 0.168445 -1.474018 one
2016-01-03 -1.308454 0.625522 -2.465547 1.757797 two
2016-01-04 -1.430586 -0.732160 -0.034836 0.216295 three
2016-01-05 -0.519748 0.386824 -2.775289 -0.088892 four
2016-01-06 1.027911 -0.311089 0.646725 0.773003 three
df2[df2[‘E‘].isin([‘one‘,‘three‘])]
A B C D E
2016-01-01 -0.808397 -1.548973 1.013311 1.981536 one
2016-01-02 1.966543 0.468294 0.168445 -1.474018 one
2016-01-04 -1.430586 -0.732160 -0.034836 0.216295 three
2016-01-06 1.027911 -0.311089 0.646725 0.773003 three

### 数据的设置

通过索引匹配插入新的一列

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range(‘20160102‘, periods=6))
df[‘F‘] = s1
df
A B C D F
2016-01-01 0.000000 0.000000 1.013311 1.981536 NaN
2016-01-02 1.966543 0.468294 0.168445 -1.474018 1.0
2016-01-03 -1.308454 0.625522 -2.465547 1.757797 2.0
2016-01-04 -1.430586 -0.732160 -0.034836 0.216295 3.0
2016-01-05 -0.519748 0.386824 -2.775289 -0.088892 4.0
2016-01-06 1.027911 -0.311089 0.646725 0.773003 5.0

也可以通过标签来赋值

df.at[dates[0], ‘A‘] = 0
df
A B C D F
2016-01-01 0.000000 0.000000 1.013311 1.981536 NaN
2016-01-02 1.966543 0.468294 0.168445 -1.474018 1.0
2016-01-03 -1.308454 0.625522 -2.465547 1.757797 2.0
2016-01-04 -1.430586 -0.732160 -0.034836 0.216295 3.0
2016-01-05 -0.519748 0.386824 -2.775289 -0.088892 4.0
2016-01-06 1.027911 -0.311089 0.646725 0.773003 5.0

通过位来赋值

df.iat[0, 1] = 0
df
A B C D F
2016-01-01 0.000000 0.000000 1.013311 1.981536 NaN
2016-01-02 1.966543 0.468294 0.168445 -1.474018 1.0
2016-01-03 -1.308454 0.625522 -2.465547 1.757797 2.0
2016-01-04 -1.430586 -0.732160 -0.034836 0.216295 3.0
2016-01-05 -0.519748 0.386824 -2.775289 -0.088892 4.0
2016-01-06 1.027911 -0.311089 0.646725 0.773003 5.0

将numpy数组赋值给某列

df.loc[:, ‘D‘] = np.array([5] * len(df))
df
A B C D F
2016-01-01 0.000000 0.000000 1.013311 5 NaN
2016-01-02 1.966543 0.468294 0.168445 5 1.0
2016-01-03 -1.308454 0.625522 -2.465547 5 2.0
2016-01-04 -1.430586 -0.732160 -0.034836 5 3.0
2016-01-05 -0.519748 0.386824 -2.775289 5 4.0
2016-01-06 1.027911 -0.311089 0.646725 5 5.0
df2 = df.copy()
df2[df2>0] = -df2
df2
A B C D F
2016-01-01 0.000000 0.000000 -1.013311 -5 NaN
2016-01-02 -1.966543 -0.468294 -0.168445 -5 -1.0
2016-01-03 -1.308454 -0.625522 -2.465547 -5 -2.0
2016-01-04 -1.430586 -0.732160 -0.034836 -5 -3.0
2016-01-05 -0.519748 -0.386824 -2.775289 -5 -4.0
2016-01-06 -1.027911 -0.311089 -0.646725 -5 -5.0

## 处理缺失数据

pandas使用np.nan来表征缺失数据,这些数据在计算时默认不会被使用

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + [‘E‘])
df1.loc[dates[0]:dates[1], ‘E‘] = 1
df1
A B C D F E
2016-01-01 0.000000 0.000000 1.013311 5 NaN 1.0
2016-01-02 1.966543 0.468294 0.168445 5 1.0 1.0
2016-01-03 -1.308454 0.625522 -2.465547 5 2.0 NaN
2016-01-04 -1.430586 -0.732160 -0.034836 5 3.0 NaN

方案一、丢弃所有数据缺失的行

df1.dropna(how=‘any‘)
A B C D F E
2016-01-02 1.966543 0.468294 0.168445 5 1.0 1.0

方案二、填充缺失值

df1.fillna(value=5)
A B C D F E
2016-01-01 0.000000 0.000000 1.013311 5 5.0 1.0
2016-01-02 1.966543 0.468294 0.168445 5 1.0 1.0
2016-01-03 -1.308454 0.625522 -2.465547 5 2.0 5.0
2016-01-04 -1.430586 -0.732160 -0.034836 5 3.0 5.0

可以获取到缺失数据的掩码

df.isnull()
# 值为True的位置即是数据缺失的位置
A B C D F
2016-01-01 False False False False True
2016-01-02 False False False False False
2016-01-03 False False False False False
2016-01-04 False False False False False
2016-01-05 False False False False False
2016-01-06 False False False False False

## 数据操作

数据操作默认不会使用缺失值

### 状态操作

df.mean()

A -0.044056
B 0.072898
C -0.574532
D 5.000000
F 3.000000
dtype: float64

# 行内统计
df.mean(1)

2016-01-01 1.503328
2016-01-02 1.720656
2016-01-03 0.770304
2016-01-04 1.160484
2016-01-05 1.218357
2016-01-06 2.272709
Freq: D, dtype: float64

当对维度不同的数据进行操作时, 数据之间需要对其,pandas会自动在不同维度之间进行广播

s = pd.Series([1,3,5,np.nan, 6, 8], index=dates).shift(2)
s

2016-01-01 NaN
2016-01-02 NaN
2016-01-03 1.0
2016-01-04 3.0
2016-01-05 5.0
2016-01-06 NaN
Freq: D, dtype: float64

df.sub(s, axis=‘index‘)
A B C D F
2016-01-01 NaN NaN NaN NaN NaN
2016-01-02 NaN NaN NaN NaN NaN
2016-01-03 -2.308454 -0.374478 -3.465547 4.0 1.0
2016-01-04 -4.430586 -3.732160 -3.034836 2.0 0.0
2016-01-05 -5.519748 -4.613176 -7.775289 0.0 -1.0
2016-01-06 NaN NaN NaN NaN NaN

### 函数应用

将函数应用到数据上

df.apply(np.cumsum, axis=0)
A B C D F
2016-01-01 0.000000 0.000000 1.013311 5 NaN
2016-01-02 1.966543 0.468294 1.181756 10 1.0
2016-01-03 0.658089 1.093815 -1.283791 15 3.0
2016-01-04 -0.772497 0.361655 -1.318627 20 6.0
2016-01-05 -1.292245 0.748479 -4.093916 25 10.0
2016-01-06 -0.264334 0.437390 -3.447191 30 15.0
df.apply(lambda x: x.max() - x.min())

A 3.397129
B 1.357682
C 3.788600
D 0.000000
F 4.000000
dtype: float64

## 直方图

s = pd.Series(np.random.randint(0,7,size=10))
s

0 1
1 5
2 6
3 5
4 6
5 4
6 0
7 3
8 6
9 5
dtype: int64

s.value_counts()

6 3
5 3
4 1
3 1
1 1
0 1
dtype: int64

### 字符串操作

s = pd.Series([‘A‘, ‘B‘, ‘C‘, ‘Aaba‘, ‘BAcd‘, np.nan, ‘CBA‘, ‘dog‘, ‘CAT‘])
s.str.lower()

0 a
1 b
2 c
3 aaba
4 bacd
5 NaN
6 cba
7 dog
8 cat
dtype: object

## 数据合并

### 数据连接 concat

# concat 函数将对象连接在一起
df = pd.DataFrame(np.random.randn(10,4))
df
0 1 2 3
0 -0.859307 -0.723708 -1.121663 1.438285
1 -0.168126 -0.343567 0.678940 0.394126
2 -0.541090 1.908998 -0.543378 -0.109371
3 -1.108110 0.332687 -1.320752 1.022476
4 0.591171 -1.259859 0.930266 0.688108
5 -0.065470 -0.957394 1.423691 -0.295647
6 1.728151 0.162709 0.836916 -0.573260
7 -0.025487 0.307945 -0.414787 -0.045495
8 -0.601439 -0.167967 -1.198304 0.242739
9 0.495473 -0.348495 1.599757 0.184015
pieces = [df[:3], df[5:]]
pd.concat(pieces)
0 1 2 3
0 -0.859307 -0.723708 -1.121663 1.438285
1 -0.168126 -0.343567 0.678940 0.394126
2 -0.541090 1.908998 -0.543378 -0.109371
5 -0.065470 -0.957394 1.423691 -0.295647
6 1.728151 0.162709 0.836916 -0.573260
7 -0.025487 0.307945 -0.414787 -0.045495
8 -0.601439 -0.167967 -1.198304 0.242739
9 0.495473 -0.348495 1.599757 0.184015

### 数据SQL风格的连接 merge

left = pd.DataFrame({‘key‘: [‘foo‘, ‘foo‘], ‘lval‘: [1,2]})
right = pd.DataFrame({‘key‘: [‘foo‘, ‘foo‘], ‘rval‘: [4,5]})
left
key lval
0 foo 1
1 foo 2
right
key rval
0 foo 4
1 foo 5
pd.merge(left, right, on=‘key‘)
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

### Append

在Dateframe对象尾部添加数据

df = pd.DataFrame(np.random.randn(8, 4), columns=[‘A‘,‘B‘,‘C‘,‘D‘])
df
A B C D
0 -0.535803 -0.319896 -0.313776 -0.401106
1 -0.231405 2.058233 0.771222 0.170204
2 -1.699222 -0.098205 0.465100 0.295165
3 -0.273538 -0.902247 -0.328348 0.771312
4 0.080118 0.796800 0.564468 0.526290
5 0.485221 0.478245 -0.943854 -0.097568
6 -0.440915 0.134749 -0.840602 -0.836712
7 -0.283432 -0.029233 1.725972 -0.878117
s = df.iloc[3]
df.append(s, ignore_index=True)
A B C D
0 -0.535803 -0.319896 -0.313776 -0.401106
1 -0.231405 2.058233 0.771222 0.170204
2 -1.699222 -0.098205 0.465100 0.295165
3 -0.273538 -0.902247 -0.328348 0.771312
4 0.080118 0.796800 0.564468 0.526290
5 0.485221 0.478245 -0.943854 -0.097568
6 -0.440915 0.134749 -0.840602 -0.836712
7 -0.283432 -0.029233 1.725972 -0.878117
8 -0.273538 -0.902247 -0.328348 0.771312

### 分组聚合 groupby

分组聚合一般而言经历一下步骤:
- 按约束条件将数据分组
- 使用某个函数处理分好组的数据
- 将处理好的数据合并在一起

df = pd.DataFrame({
        ‘A‘: [‘foo‘,‘bar‘,‘foo‘,‘bar‘,‘foo‘,‘bar‘,‘foo‘,‘foo‘],
        ‘B‘: [‘one‘,‘one‘,‘two‘,‘three‘, ‘two‘, ‘two‘, ‘one‘, ‘three‘],
        ‘C‘: np.random.randn(8),
        ‘D‘: np.random.randn(8)
    })
df
A B C D
0 foo one 0.996471 0.659993
1 bar one 0.990690 -1.102114
2 foo two -0.138965 0.236194
3 bar three 0.033469 0.253152
4 foo two -0.574320 0.081216
5 bar two 1.992456 0.939238
6 foo one -0.514013 -1.610422
7 foo three -0.640462 -1.606399

分组聚合后将sum函数应用到分组数据上

df.groupby(‘A‘).sum()
C D
A
bar 3.016615 0.090276
foo -0.871289 -2.239418

多重分组聚合之后,应用sum函数

df.groupby([‘A‘, ‘B‘]).sum()
C D
A B
bar one 0.990690 -1.102114
three 0.033469 0.253152
two 1.992456 0.939238
foo one 0.482458 -0.950429
three -0.640462 -1.606399
two -0.713285 0.317410

## 重塑和轴向旋转

### 轴向旋转

tuples = list(zip(*[
            [‘bar‘,‘bar‘, ‘baz‘, ‘baz‘, ‘foo‘,‘foo‘, ‘qux‘, ‘qux‘],
            [‘one‘, ‘two‘, ‘one‘, ‘two‘, ‘one‘, ‘two‘, ‘one‘, ‘two‘ ]
        ]))
index = pd.MultiIndex.from_tuples(tuples, names=[‘first‘, ‘second‘])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=[‘A‘,‘B‘])
df2 = df[:4]
df2
A B
first second
bar one -0.084595 1.495368
two -0.801703 -0.663997
baz one -0.108681 -0.986022
two -0.524829 0.983664
# stack方法将列转换为行
stacked = df2.stack()
stacked

first second
bar one A -0.084595
B 1.495368
two A -0.801703
B -0.663997
baz one A -0.108681
B -0.986022
two A -0.524829
B 0.983664
dtype: float64

# unstack方法将行转为列
stacked.unstack()
A B
first second
bar one -0.084595 1.495368
two -0.801703 -0.663997
baz one -0.108681 -0.986022
two -0.524829 0.983664
# 默认unstack操作的是最内层的数据,可以指定层数
stacked.unstack(1)
second one two
first
bar A -0.084595 -0.801703
B 1.495368 -0.663997
baz A -0.108681 -0.524829
B -0.986022 0.983664
stacked.unstack(0)
first bar baz
second
one A -0.084595 -0.108681
B 1.495368 -0.986022
two A -0.801703 -0.524829
B -0.663997 0.983664

### 透视表

df = pd.DataFrame({
        ‘A‘: [‘one‘, ‘two‘, ‘three‘, ‘four‘] * 3,
        ‘B‘: [‘A‘, ‘B‘, ‘C‘] * 4,
        ‘C‘: [‘foo‘, ‘foo‘, ‘foo‘, ‘bar‘, ‘bar‘, ‘bar‘] * 2,
        ‘D‘: np.random.randn(12),
        ‘E‘: np.random.randn(12)

    })
df
A B C D E
0 one A foo 0.319799 -1.264188
1 two B foo 0.929552 -0.092799
2 three C foo -2.510099 0.979121
3 four A bar 1.727211 0.083378
4 one B bar 0.636672 -0.167700
5 two C bar 0.337749 0.782511
6 three A foo 0.429180 -2.415025
7 four B foo 0.334974 -1.997174
8 one C foo 0.248257 -1.003121
9 two A bar 0.465319 1.133168
10 three B bar 0.111670 -0.730784
11 four C bar -1.903981 -0.089501
pd.pivot_table(df, values=‘D‘, index=[‘A‘, ‘B‘], columns=[‘C‘])
C bar foo
A B
four A 1.727211 NaN
B NaN 0.334974
C -1.903981 NaN
one A NaN 0.319799
B 0.636672 NaN
C NaN 0.248257
three A NaN 0.429180
B 0.111670 NaN
C NaN -2.510099
two A 0.465319 NaN
B NaN 0.929552
C 0.337749 NaN

## 时间序列

pandas提供了简单有效的处理时间频率的函数。

rng = pd.date_range(‘1/1/2016‘, periods=100, freq=‘S‘)

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

ts.resample(‘10S‘).sum()

2016-01-01 00:00:00 2910
2016-01-01 00:00:10 2506
2016-01-01 00:00:20 2812
2016-01-01 00:00:30 2923
2016-01-01 00:00:40 2510
2016-01-01 00:00:50 2817
2016-01-01 00:01:00 2672
2016-01-01 00:01:10 2486
2016-01-01 00:01:20 3243
2016-01-01 00:01:30 2865
Freq: 10S, dtype: int64

时区转换

rng = pd.date_range(‘2/2/2016 00:00‘, periods=5 , freq=‘D‘)

ts = pd.Series(np.random.randn(len(rng)), rng)

ts

2016-02-02 -0.662500
2016-02-03 -0.762211
2016-02-04 0.954675
2016-02-05 -0.411404
2016-02-06 0.237898
Freq: D, dtype: float64

ts_utc = ts.tz_localize(‘UTC‘)

ts_utc

2016-02-02 00:00:00+00:00 -0.662500
2016-02-03 00:00:00+00:00 -0.762211
2016-02-04 00:00:00+00:00 0.954675
2016-02-05 00:00:00+00:00 -0.411404
2016-02-06 00:00:00+00:00 0.237898
Freq: D, dtype: float64

转化到其它时区

ts_utc.tz_convert(‘Asia/Shanghai‘)

2016-02-02 08:00:00+08:00 -0.662500
2016-02-03 08:00:00+08:00 -0.762211
2016-02-04 08:00:00+08:00 0.954675
2016-02-05 08:00:00+08:00 -0.411404
2016-02-06 08:00:00+08:00 0.237898
Freq: D, dtype: float64

时间戳和时期之间的专转换

ran = pd.date_range(‘1/1/2016‘, periods=5, freq=‘M‘)

ts = pd.Series(np.random.randn(len(rng)), index=rng)

ts

2016-02-02 -2.143138
2016-02-03 1.683414
2016-02-04 -0.427250
2016-02-05 -0.900378
2016-02-06 -1.039857
Freq: D, dtype: float64

ps = ts.to_period()
ps

2016-02-02 -2.143138
2016-02-03 1.683414
2016-02-04 -0.427250
2016-02-05 -0.900378
2016-02-06 -1.039857
Freq: D, dtype: float64

ps.to_timestamp()

2016-02-02 -2.143138
2016-02-03 1.683414
2016-02-04 -0.427250
2016-02-05 -0.900378
2016-02-06 -1.039857
Freq: D, dtype: float64

## 种类类型

df = pd.DataFrame({
        "id": [1,2,3,4,5,6],
        "raw_grade": [‘a‘,‘b‘,‘b‘,‘a‘,‘a‘,‘e‘]
    })

df[‘grade‘] = df[‘raw_grade‘].astype(‘category‘)
df[‘grade‘]

0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

df.grade.cat.categories = [‘very good‘, ‘good‘, ‘bad‘]
df.grade

0 very good
1 good
2 good
3 very good
4 very good
5 bad
Name: grade, dtype: category
Categories (3, object): [very good, good, bad]

df.grade = df.grade.cat.set_categories([‘very bad‘, ‘bad‘, ‘medium‘, ‘good‘, ‘very good‘])
df.grade

0 very good
1 good
2 good
3 very good
4 very good
5 bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

df.sort_values(by=‘grade‘)
id raw_grade grade
5 6 e bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
df.groupby(‘grade‘).size()
grade
very bad     0
bad          1
medium       0
good         2
very good    3
dtype: int64

绘图

ts = pd.Series(np.random.randn(1000), index=pd.date_range(‘1/1/2000‘, periods=1000))

ts = ts.cumsum()

ts.plot(grid=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7ffa41fa9908>

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list(‘ABCD‘))
df = df.cumsum()

plt.figure()
df.plot(grid=True)
plt.legend(loc=‘best‘)
<matplotlib.legend.Legend at 0x7ffa41de6e48>

<matplotlib.figure.Figure at 0x7ffa41f1e198>

时间: 2024-10-05 23:26:10

pandas 新手指引的相关文章

iOS: 首次使用App时,显示半透明新手指引

在很多的app,我们都会发现这样一个功能:就是app启动后进入主界面时,会有一个半透明的指引图,它会提示用户如何一步步进行操作,快速的熟悉app的使用规则,极大地方便了用户的使用,也加快了app的推广,优点不言而喻. 我主要介绍一下思路: 首先创建一个半透明的蒙版覆盖在当前整个屏幕上,然后用贝塞尔曲线绘制白色的提示框(矩形或者圆形),接着给出带箭头图标的文字提示,也即在蒙版上添加自定义的子视图控件.当然,最后给整个蒙版添加一个触摸手势,只要轻轻点击就移除蒙版.子视图.手势,恢复正常界面. 注意:

40.Android之新手指引界面学习

我们经常可以看到打开新App会有新手指引界面,类似蒙板效果今天来学习.原理其实很简单,设置一个透明Activity或者Dialog,然后修改其属性即可.由于实现比较简单,就贴一部分代码. 1.在Androidmanifest.xml增加 1 <activity 2 android:name=".TransparentActivity" //你的Activity 3 android:theme="@style/TransparenceTheme"> 4 &

首次启动优美新手指引tip

前言 系列文章:[传送门] 期末考试复习,就准备这系列的博客,记录 概率论 复习的成果. 正文 内容来自 概率论相关书籍 及 资料,有疑问请留言. 随机试验 · 样本空间 任何一个过程,如果它的结果是随机的(无法事前知道),那么该过程就称为一个随机试验(E).具有三个性质: (1)每次试验的可能结果不止一个,并且能事先明确试验的所有可能结果. (2)进行一次试验之前无法确定哪一个结果会出现. (3) 可以在同一条件下重复进行试验. 实验所有可能的结果组成一个集合(set),叫做样本空间(samp

用UGUI制作手游新手指引

因为这几天工作上的需要,研究了下用UGUI制作新手指引.可以实现这个效果的方法有很多,都用了一遍,最后还是感觉这个方法比较好.我们需要创建一个画布,然后在画布下创建需要新手指引用到的按钮, 然后在需要新手指引的按钮上加2个组件,一个是 Graphic Raycaster 和Canvas ,Graphic Raycaster是用来是否接受可以交互的,Canvas是改变显示的,记得我们这里需要一个遮罩,就用一个Image来做吧 然后我们给这个4个按钮都加上这2个组件,因为这里我是测试所以我需要一运行

开源整理:Android App新手指引开源控件

开源整理:Android App新手指引开源控件 一个App第一次与用户接触或者发生大版本更新时,常常会用户进行新手引导,而一个好的新手指引,往往能够方便新用户快速了解操作你的应用功能.新手指引的重要性,不言而喻.本文搜集整理了Github上一些效果不错的新手指引开源控件,帮助你的应用在用户面前有更好的效果展示.当然,如果你有精力,也可以自己开发维护一套新手指引效果. GuideView https://github.com/binIoter/GuideView 国人开发者出品的一个轻量级新手指

[MarsZ]Unity3d游戏开发之Unity3d全策划配置新手指引

Unity3d全策划配置新手指引 前言... 2 版本... 2 作者... 2 功能... 2 类型... 2 触发类型... 2 步骤类型... 3 实现... 4 简要... 4 策划方面... 4 程序方面... 4 流程图... 5 详细技术方案... 6 程序主要逻辑... 6 关键细节答疑... 6  附:word版百度云盘下载 http://pan.baidu.com/s/1DbNxs 前言 本文档描述Unity3d下支持策划灵活配置.多样性丰富的新手指引的相关说明,如有设计上不

反向遮罩 (新手指引 镂空 镂空区域可穿透点击)

参考: 新手引导镂空方案 使用RenderTexture创建反遮罩或橡皮擦 一.首先创建一个背景 let bg:eui.Image = new eui.Image(RES.getRes("bg_jpg")); this.addChild(bg); 二.创建一个圆 let sp:egret.Sprite = new egret.Sprite(); sp.graphics.beginFill(0xff0000); sp.graphics.drawCircle(100,100,100); s

GuidelinesOfGameDevelopment游戏开发新手指引

# GuidelinesOfGameDevelopment Just give out some experience or directions on game development to green hands.分享经验或路线给新手们 正文:最新指引链接 游戏类型:手游.端游.页游.家用游戏(电视) 游戏相关职位: 客户端: 游戏逻辑(常见功能和特色功能开发),将用户体验做到极致,未来方向应该是主程.游戏制作人和游戏玩法创新.C#或Lua.js.我想说,让Lua滚出游戏界,算了,还是我退出

新手指引

1.拆分成指引小组 把指引步骤,尽量拆分的细一些.比如,虽然创建英雄祭坛和招募英雄是连续的步骤,但是,应该拆分成两组指引,创建,招募各代表一组.最理想的情况是,每一步只跟服务器进行一次信息同步.这样,就比较容易进行断点返还. 每一个小组有一个编号 比如 1000,每一个小组中的每一步都有一个小步骤编号如1,2,3.这样每一步的指引就有一个唯一的编号=小组编号 +步骤编号, 如1001. 2.每一步的触发条件 指引系统的每一步,不应该自己一步步驱动,也不应该通过计时控制,而应该是有游戏的业务逻辑驱