Pandas Cookbook -- 07 分组聚合、过滤、转换

分组聚合、过滤、转换

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

设定最大列数和最大行数

pd.set_option(‘max_columns‘,8 , ‘max_rows‘, 8)

1 聚合

读取flights数据集,查询头部

flights = pd.read_csv(‘data/flights.csv‘)
flights.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

MONTH DAY WEEKDAY AIRLINE ... SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN ... 1905 65.0 0 0
1 1 1 4 UA ... 1333 -13.0 0 0
2 1 1 4 MQ ... 1453 35.0 0 0
3 1 1 4 AA ... 1935 -7.0 0 0
4 1 1 4 WN ... 2225 39.0 0 0

5 rows × 14 columns

1.1 单列聚合

按照AIRLINE分组,使用agg方法,传入要聚合的列和聚合函数

flights.groupby(‘AIRLINE‘).agg({‘ARR_DELAY‘:‘mean‘}).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

ARR_DELAY
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580

或者要选取的列使用索引,聚合函数作为字符串传入agg

flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].agg(‘mean‘).head()
AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

也可以向agg中传入NumPy的mean函数

flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].agg(np.mean).head()
AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

也可以直接使用mean()函数

flights.groupby(‘AIRLINE‘)[‘ARR_DELAY‘].mean().head()
AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

1.2 多列聚合

每家航空公司每周平均每天取消的航班数

flights.groupby([‘AIRLINE‘, ‘WEEKDAY‘])[‘CANCELLED‘].agg(‘sum‘).head(7)
AIRLINE  WEEKDAY
AA       1          41
         2           9
         3          16
         4          20
         5          18
         6          21
         7          29
Name: CANCELLED, dtype: int64

分组可以是多个

选取可以是多组

聚合函数也可以是多个

每周每家航空公司取消或改变航线的航班总数和比例

flights.groupby([‘AIRLINE‘, ‘WEEKDAY‘])[‘CANCELLED‘, ‘DIVERTED‘].agg([‘sum‘, ‘mean‘]).head(7)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

CANCELLED DIVERTED
sum mean sum mean
AIRLINE WEEKDAY
AA 1 41 0.032106 6 0.004699
2 9 0.007341 2 0.001631
3 16 0.011949 2 0.001494
4 20 0.015004 5 0.003751
5 18 0.014151 1 0.000786
6 21 0.018667 9 0.008000
7 29 0.021837 1 0.000753

用列表和嵌套字典对多列分组和聚合

对于每条航线,找到总航班数,取消的数量和比例,飞行时间的平均时间和方差

group_cols = [‘ORG_AIR‘, ‘DEST_AIR‘]
agg_dict = {‘CANCELLED‘:[‘sum‘, ‘mean‘, ‘size‘],
            ‘AIR_TIME‘:[‘mean‘, ‘var‘]}
flights.groupby(group_cols).agg(agg_dict).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

CANCELLED AIR_TIME
sum mean size mean var
ORG_AIR DEST_AIR
ATL ABE 0 0.0 31 96.387097 45.778495
ABQ 0 0.0 16 170.500000 87.866667
ABY 0 0.0 19 28.578947 6.590643
ACY 0 0.0 6 91.333333 11.466667
AEX 0 0.0 40 78.725000 47.332692

1.3 DataFrameGroupBy对象

groupby方法产生的是一个DataFrameGroupBy对象

college = pd.read_csv(‘data/college.csv‘)
grouped = college.groupby([‘STABBR‘, ‘RELAFFIL‘])

查看分组对象的类型

type(grouped)
pandas.core.groupby.groupby.DataFrameGroupBy

用dir函数找到该对象所有的可用函数

print([attr for attr in dir(grouped) if not attr.startswith(‘_‘)])
[‘CITY‘, ‘CURROPER‘, ‘DISTANCEONLY‘, ‘GRAD_DEBT_MDN_SUPP‘, ‘HBCU‘, ‘INSTNM‘, ‘MD_EARN_WNE_P10‘, ‘MENONLY‘, ‘PCTFLOAN‘, ‘PCTPELL‘, ‘PPTUG_EF‘, ‘RELAFFIL‘, ‘SATMTMID‘, ‘SATVRMID‘, ‘STABBR‘, ‘UG25ABV‘, ‘UGDS‘, ‘UGDS_2MOR‘, ‘UGDS_AIAN‘, ‘UGDS_ASIAN‘, ‘UGDS_BLACK‘, ‘UGDS_HISP‘, ‘UGDS_NHPI‘, ‘UGDS_NRA‘, ‘UGDS_UNKN‘, ‘UGDS_WHITE‘, ‘WOMENONLY‘, ‘agg‘, ‘aggregate‘, ‘all‘, ‘any‘, ‘apply‘, ‘backfill‘, ‘bfill‘, ‘boxplot‘, ‘corr‘, ‘corrwith‘, ‘count‘, ‘cov‘, ‘cumcount‘, ‘cummax‘, ‘cummin‘, ‘cumprod‘, ‘cumsum‘, ‘describe‘, ‘diff‘, ‘dtypes‘, ‘expanding‘, ‘ffill‘, ‘fillna‘, ‘filter‘, ‘first‘, ‘get_group‘, ‘groups‘, ‘head‘, ‘hist‘, ‘idxmax‘, ‘idxmin‘, ‘indices‘, ‘last‘, ‘mad‘, ‘max‘, ‘mean‘, ‘median‘, ‘min‘, ‘ndim‘, ‘ngroup‘, ‘ngroups‘, ‘nth‘, ‘nunique‘, ‘ohlc‘, ‘pad‘, ‘pct_change‘, ‘pipe‘, ‘plot‘, ‘prod‘, ‘quantile‘, ‘rank‘, ‘resample‘, ‘rolling‘, ‘sem‘, ‘shift‘, ‘size‘, ‘skew‘, ‘std‘, ‘sum‘, ‘tail‘, ‘take‘, ‘transform‘, ‘tshift‘, ‘var‘]

用ngroups属性查看分组的数量

grouped.ngroups
112

查看每个分组的唯一识别标签

groups属性是一个字典,包含每个独立分组与行索引标签的对应

groups = list(grouped.groups.keys())
groups[:6]
[(‘AK‘, 0), (‘AK‘, 1), (‘AL‘, 0), (‘AL‘, 1), (‘AR‘, 0), (‘AR‘, 1)]

用get_group,传入分组标签的元组

例如,获取佛罗里达州所有与宗教相关的学校

grouped.get_group((‘FL‘, 1)).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
712 The Baptist College of Florida Graceville FL 0.0 ... 0.5602 0.3531 30800 20052
713 Barry University Miami FL 0.0 ... 0.6733 0.4361 44100 28250
714 Gooding Institute of Nurse Anesthesia Panama City FL 0.0 ... NaN NaN NaN PrivacySuppressed
715 Bethune-Cookman University Daytona Beach FL 1.0 ... 0.8867 0.0647 29400 36250
724 Johnson University Florida Kissimmee FL 0.0 ... 0.7384 0.2185 26300 20199

5 rows × 27 columns

groupby对象是一个可迭代对象,可以挨个查看每个独立分组

i = 0
for name, group in grouped:
    print(name)
    display(group.head(2))
    i += 1
    if i == 5:
     break
(‘AK‘, 0)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
60 University of Alaska Anchorage Anchorage AK 0.0 ... 0.2647 0.4386 42500 19449.5
62 University of Alaska Fairbanks Fairbanks AK 0.0 ... 0.2550 0.4519 36200 19355

2 rows × 27 columns

(‘AK‘, 1)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
61 Alaska Bible College Palmer AK 0.0 ... 0.2857 0.4286 NaN PrivacySuppressed
64 Alaska Pacific University Anchorage AK 0.0 ... 0.5297 0.4910 47000 23250

2 rows × 27 columns

(‘AL‘, 0)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5

2 rows × 27 columns

(‘AL‘, 1)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
2 Amridge University Montgomery AL 0.0 ... 0.7795 0.8540 40100 23370
10 Birmingham Southern College Birmingham AL 0.0 ... 0.4809 0.0152 44200 27000

2 rows × 27 columns

(‘AR‘, 0)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
128 University of Arkansas at Little Rock Little Rock AR 0.0 ... 0.4775 0.4062 33900 21736
129 University of Arkansas for Medical Sciences Little Rock AR 0.0 ... 0.6144 0.5133 61400 12500

2 rows × 27 columns

groupby对象使用head方法,可以在一个DataFrame钟显示每个分组的头几行

grouped.head(2).head(6)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5
2 Amridge University Montgomery AL 0.0 ... 0.7795 0.8540 40100 23370
10 Birmingham Southern College Birmingham AL 0.0 ... 0.4809 0.0152 44200 27000
43 Prince Institute-Southeast Elmhurst IL 0.0 ... 0.9375 0.6569 PrivacySuppressed 20992
60 University of Alaska Anchorage Anchorage AK 0.0 ... 0.2647 0.4386 42500 19449.5

6 rows × 27 columns

nth方法可以选出每个分组指定行的数据,下面选出的是第1行和最后1行

grouped.nth([1, -1]).head(8)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

CITY CURROPER DISTANCEONLY GRAD_DEBT_MDN_SUPP ... UGDS_NRA UGDS_UNKN UGDS_WHITE WOMENONLY
STABBR RELAFFIL
AK 0 Fairbanks 1 0.0 19355 ... 0.0110 0.3060 0.4259 0.0
0 Barrow 1 0.0 PrivacySuppressed ... 0.0183 0.0000 0.1376 0.0
1 Anchorage 1 0.0 23250 ... 0.0000 0.0873 0.5309 0.0
1 Soldotna 1 0.0 PrivacySuppressed ... 0.0000 0.1324 0.0588 0.0
AL 0 Birmingham 1 0.0 21941.5 ... 0.0179 0.0100 0.5922 0.0
0 Dothan 1 0.0 PrivacySuppressed ... NaN NaN NaN 0.0
1 Birmingham 1 0.0 27000 ... 0.0000 0.0051 0.7983 0.0
1 Huntsville 1 NaN 36173.5 ... NaN NaN NaN NaN

8 rows × 25 columns

2 聚合函数

college = pd.read_csv(‘data/college.csv‘)
college.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY STABBR HBCU ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama A & M University Normal AL 1.0 ... 0.8284 0.1049 30300 33888
1 University of Alabama at Birmingham Birmingham AL 0.0 ... 0.5214 0.2422 39700 21941.5
2 Amridge University Montgomery AL 0.0 ... 0.7795 0.8540 40100 23370
3 University of Alabama in Huntsville Huntsville AL 0.0 ... 0.4596 0.2640 45500 24097
4 Alabama State University Montgomery AL 1.0 ... 0.7554 0.1270 26600 33118.5

5 rows × 27 columns

2.1 自定义聚合函数

求出每个州的本科生的平均值和标准差

college.groupby(‘STABBR‘)[‘UGDS‘].agg([‘mean‘, ‘std‘]).round(0).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

mean std
STABBR
AK 2493.0 4052.0
AL 2790.0 4658.0
AR 1644.0 3143.0
AS 1276.0 NaN
AZ 4130.0 14894.0

远离平均值的标准差的最大个数,写一个自定义函数

def max_deviation(s):
    std_score = (s - s.mean()) / s.std()
    return std_score.abs().max()

agg聚合函数在调用方法时,直接引入自定义的函数名

college.groupby(‘STABBR‘)[‘UGDS‘].agg(max_deviation).round(1).head()
STABBR
AK    2.6
AL    5.8
AR    6.3
AS    NaN
AZ    9.9
Name: UGDS, dtype: float64

自定义的聚合函数也适用于多个数值列

college.groupby(‘STABBR‘)[‘UGDS‘, ‘SATVRMID‘, ‘SATMTMID‘].agg(max_deviation).round(1).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

UGDS SATVRMID SATMTMID
STABBR
AK 2.6 NaN NaN
AL 5.8 1.6 1.8
AR 6.3 2.2 2.3
AS NaN NaN NaN
AZ 9.9 1.9 1.4

自定义聚合函数也可以和预先定义的函数一起使用

college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘, ‘SATVRMID‘, ‘SATMTMID‘].agg([max_deviation, ‘mean‘, ‘std‘]).round(1).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

UGDS SATVRMID SATMTMID
max_deviation mean std max_deviation ... std max_deviation mean std
STABBR RELAFFIL
AK 0 2.1 3508.9 4539.5 NaN ... NaN NaN NaN NaN
1 1.1 123.3 132.9 NaN ... NaN NaN 503.0 NaN
AL 0 5.2 3248.8 5102.4 1.6 ... 56.5 1.7 515.8 56.7
1 2.4 979.7 870.8 1.5 ... 53.0 1.4 485.6 61.4
AR 0 5.8 1793.7 3401.6 1.9 ... 37.9 2.0 503.6 39.0

5 rows × 9 columns

Pandas使用函数名作为返回列的名字;你可以直接使用rename方法修改,或通过__name__属性修改

max_deviation.__name__
‘max_deviation‘
max_deviation.__name__ = ‘Max Deviation‘
college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘, ‘SATVRMID‘, ‘SATMTMID‘]    .agg([max_deviation, ‘mean‘, ‘std‘]).round(1).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

UGDS SATVRMID SATMTMID
Max Deviation mean std Max Deviation ... std Max Deviation mean std
STABBR RELAFFIL
AK 0 2.1 3508.9 4539.5 NaN ... NaN NaN NaN NaN
1 1.1 123.3 132.9 NaN ... NaN NaN 503.0 NaN
AL 0 5.2 3248.8 5102.4 1.6 ... 56.5 1.7 515.8 56.7
1 2.4 979.7 870.8 1.5 ... 53.0 1.4 485.6 61.4
AR 0 5.8 1793.7 3401.6 1.9 ... 37.9 2.0 503.6 39.0

5 rows × 9 columns

2.2 用 *args 和 **kwargs 自定义聚合函数

自定义一个返回去本科生人数在1000和3000之间的比例的函数

def pct_between_1_3k(s):
    return s.between(1000, 3000).mean()

用州和宗教分组,再聚合

college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between_1_3k).head(9)
STABBR  RELAFFIL
AK      0           0.142857
        1           0.000000
AL      0           0.236111
        1           0.333333
                      ...
AR      1           0.111111
AS      0           1.000000
AZ      0           0.096774
        1           0.000000
Name: UGDS, Length: 9, dtype: float64

但是这个函数不能让用户自定义上下限,再新写一个函数

def pct_between(s, low, high):
    return s.between(low, high).mean()

使用这个自定义聚合函数,并传入最大和最小值

college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between, 1000, 10000).head(9)
STABBR  RELAFFIL
AK      0           0.428571
        1           0.000000
AL      0           0.458333
        1           0.375000
                      ...
AR      1           0.166667
AS      0           1.000000
AZ      0           0.233871
        1           0.111111
Name: UGDS, Length: 9, dtype: float64

显示指定最大和最小值

college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between, high=10000, low=1000).head(9)
STABBR  RELAFFIL
AK      0           0.428571
        1           0.000000
AL      0           0.458333
        1           0.375000
                      ...
AR      1           0.166667
AS      0           1.000000
AZ      0           0.233871
        1           0.111111
Name: UGDS, Length: 9, dtype: float64

也可以关键字参数和非关键字参数混合使用,只要非关键字参数在后面

college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg(pct_between, 1000, high=10000).head(9)
STABBR  RELAFFIL
AK      0           0.428571
        1           0.000000
AL      0           0.458333
        1           0.375000
                      ...
AR      1           0.166667
AS      0           1.000000
AZ      0           0.233871
        1           0.111111
Name: UGDS, Length: 9, dtype: float64

Pandas不支持多重聚合时,使用参数

用闭包自定义聚合函数

def make_agg_func(func, name, *args, **kwargs):
     def wrapper(x):
         return func(x, *args, **kwargs)
     wrapper.__name__ = name
     return wrapper
my_agg1 = make_agg_func(pct_between, ‘pct_1_3k‘, low=1000, high=3000)
college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘].agg([my_agg1,make_agg_func(pct_between, ‘pct_10_30k‘, 10000, 30000)])

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

pct_1_3k pct_10_30k
STABBR RELAFFIL
AK 0 0.142857 0.142857
1 0.000000 0.000000
AL 0 0.236111 0.083333
1 0.333333 0.000000
... ... ... ...
WI 1 0.360000 0.000000
WV 0 0.246154 0.015385
1 0.375000 0.000000
WY 0 0.545455 0.000000

112 rows × 2 columns

3 聚合后去除多级索引

读取数据

flights = pd.read_csv(‘data/flights.csv‘)
flights.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

MONTH DAY WEEKDAY AIRLINE ... SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN ... 1905 65.0 0 0
1 1 1 4 UA ... 1333 -13.0 0 0
2 1 1 4 MQ ... 1453 35.0 0 0
3 1 1 4 AA ... 1935 -7.0 0 0
4 1 1 4 WN ... 2225 39.0 0 0

5 rows × 14 columns

按‘AIRLINE‘, ‘WEEKDAY‘分组,分别对DIST和ARR_DELAY聚合

airline_info = flights.groupby([‘AIRLINE‘, ‘WEEKDAY‘])    .agg({‘DIST‘:[‘sum‘, ‘mean‘],‘ARR_DELAY‘:[‘min‘, ‘max‘]})    .astype(int)
airline_info.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead tr th { text-align: left }
.dataframe thead tr:last-of-type th { text-align: right }

DIST ARR_DELAY
sum mean min max
AIRLINE WEEKDAY
AA 1 1455386 1139 -60 551
2 1358256 1107 -52 725
3 1496665 1117 -45 473
4 1452394 1089 -46 349
5 1427749 1122 -41 732

行和列都有两级索引

3.1 拼接列索引

get_level_values(0)取出第一级索引

level0 = airline_info.columns.get_level_values(0)

get_level_values(1)取出第二级索引

level1 = airline_info.columns.get_level_values(1)

一级和二级索引拼接成新的列索引

airline_info.columns = level0 + ‘_‘ + level1
airline_info.head(7)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

DIST_sum DIST_mean ARR_DELAY_min ARR_DELAY_max
AIRLINE WEEKDAY
AA 1 1455386 1139 -60 551
2 1358256 1107 -52 725
3 1496665 1117 -45 473
4 1452394 1089 -46 349
5 1427749 1122 -41 732
6 1265340 1124 -50 858
7 1461906 1100 -49 626

3.2 重置行索引

reset_index()可以将行索引变成单级

airline_info.reset_index().head(7)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

AIRLINE WEEKDAY DIST_sum DIST_mean ARR_DELAY_min ARR_DELAY_max
0 AA 1 1455386 1139 -60 551
1 AA 2 1358256 1107 -52 725
2 AA 3 1496665 1117 -45 473
3 AA 4 1452394 1089 -46 349
4 AA 5 1427749 1122 -41 732
5 AA 6 1265340 1124 -50 858
6 AA 7 1461906 1100 -49 626

Pandas默认会在分组运算后,将所有分组的列放在索引中,as_index设为False可以避免这么做。

分组后使用reset_index,也可以达到同样的效果

flights.groupby([‘AIRLINE‘], as_index=False)[‘DIST‘].agg(‘mean‘).round(0)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

AIRLINE DIST
0 AA 1114.0
1 AS 1066.0
2 B6 1772.0
3 DL 866.0
... ... ...
10 UA 1231.0
11 US 1181.0
12 VX 1240.0
13 WN 810.0

14 rows × 2 columns

4 过滤聚合

college = pd.read_csv(‘data/college.csv‘, index_col=‘INSTNM‘)
grouped = college.groupby(‘STABBR‘)
grouped.ngroups
59

这等于求出不同州的个数,nunique()可以得到同样的结果

college[‘STABBR‘].nunique()
59

自定义一个计算少数民族学生总比例的函数,如果比例大于阈值,还返回True

def check_minority(df, threshold):
    minority_pct = 1 - df[‘UGDS_WHITE‘]
    total_minority = (df[‘UGDS‘] * minority_pct).sum()
    total_ugds = df[‘UGDS‘].sum()
    total_minority_pct = total_minority / total_ugds
    return total_minority_pct > threshold

grouped变量有一个filter方法,可以接收一个自定义函数,决定是否保留一个分组

college_filtered = grouped.filter(check_minority, threshold=.5)
college_filtered.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

CITY STABBR HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM
Everest College-Phoenix Phoenix AZ 0.0 0.0 ... 0.7151 0.6700 28600 9500
Collins College Phoenix AZ 0.0 0.0 ... 0.8228 0.4764 25700 47000
Empire Beauty School-Paradise Valley Phoenix AZ 0.0 0.0 ... 0.5873 0.4651 17800 9588
Empire Beauty School-Tucson Tucson AZ 0.0 0.0 ... 0.6615 0.4229 18200 9833
Thunderbird School of Global Management Glendale AZ 0.0 0.0 ... 0.0000 0.0000 118900 PrivacySuppressed

5 rows × 26 columns

通过查看形状,可以看到过滤了60%,只有20个州的少数学生占据多数

college.shape
(7535, 26)
college_filtered.shape
(3028, 26)
college_filtered[‘STABBR‘].nunique()
20

用一些不同的阈值,检查形状和不同州的个数

college_filtered_20 = grouped.filter(check_minority, threshold=.2)
college_filtered_20.shape,college_filtered_20[‘STABBR‘].nunique()
((7461, 26), 57)
college_filtered_70 = grouped.filter(check_minority, threshold=.7)
college_filtered_70.shape,college_filtered_70[‘STABBR‘].nunique()
((957, 26), 10)
college_filtered_95 = grouped.filter(check_minority, threshold=.95)
college_filtered_95.shape,college_filtered_95[‘STABBR‘].nunique()
((156, 26), 7)

5 apply函数

apply函数是pandas里面所有函数中自由度最高的函数

读取college,‘UGDS‘, ‘SATMTMID‘, ‘SATVRMID‘三列如果有缺失值则删除行

college = pd.read_csv(‘data/college.csv‘)
subset = [‘UGDS‘, ‘SATMTMID‘, ‘SATVRMID‘]
college2 = college.dropna(subset=subset)
college.shape,college2.shape
((7535, 27), (1184, 27))

5.1 apply与agg

自定义一个求SAT数学成绩的加权平均值的函数

def weighted_math_average(df):
     weighted_math = df[‘UGDS‘] * df[‘SATMTMID‘]
     return int(weighted_math.sum() / df[‘UGDS‘].sum())

5.1.1 apply应用聚合函数

按州分组,并调用apply方法,传入自定义函数

college2.groupby(‘STABBR‘).apply(weighted_math_average).head()
STABBR
AK    503
AL    536
AR    529
AZ    569
CA    564
dtype: int64

5.1.2 agg应用聚合函数

college2.groupby(‘STABBR‘).agg(weighted_math_average).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY HBCU MENONLY ... PCTFLOAN UG25ABV MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
STABBR
AK 503 503 503 503 ... 503 503 503 503
AL 536 536 536 536 ... 536 536 536 536
AR 529 529 529 529 ... 529 529 529 529
AZ 569 569 569 569 ... 569 569 569 569
CA 564 564 564 564 ... 564 564 564 564

5 rows × 26 columns

如果将列限制到SATMTMID,会报错。这是因为不能访问UGDS。

# college2.groupby(‘STABBR‘)[‘SATMTMID‘].agg(weighted_math_average)

5.2 apply创建新列

apply的一个不错的功能是通过返回Series,创建多个新的列

from collections import OrderedDict
def weighted_average(df):
    data = OrderedDict()
    weight_m = df[‘UGDS‘] * df[‘SATMTMID‘]
    weight_v = df[‘UGDS‘] * df[‘SATVRMID‘]

    data[‘weighted_math_avg‘] = weight_m.sum() / df[‘UGDS‘].sum()
    data[‘weighted_verbal_avg‘] = weight_v.sum() / df[‘UGDS‘].sum()
    data[‘math_avg‘] = df[‘SATMTMID‘].mean()
    data[‘verbal_avg‘] = df[‘SATVRMID‘].mean()
    data[‘count‘] = len(df)
    return pd.Series(data, dtype=‘int‘)
college2.groupby(‘STABBR‘).apply(weighted_average).head(10)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

weighted_math_avg weighted_verbal_avg math_avg verbal_avg count
STABBR
AK 503 555 503 555 1
AL 536 533 504 508 21
AR 529 504 515 491 16
AZ 569 557 536 538 6
... ... ... ... ... ...
CT 545 533 522 517 14
DC 621 623 588 589 6
DE 569 553 495 486 3
FL 565 565 521 529 38

10 rows × 5 columns

5.3 apply创建dataframe

自定义一个返回DataFrame的函数

使用NumPy的函数average计算加权平均值,使用SciPy的gmean和hmean计算几何和调和平均值

from scipy.stats import gmean, hmean
def calculate_means(df):
    df_means = pd.DataFrame(index=[‘Arithmetic‘, ‘Weighted‘, ‘Geometric‘, ‘Harmonic‘])
    cols = [‘SATMTMID‘, ‘SATVRMID‘]
    for col in cols:
        arithmetic = df[col].mean()
        weighted = np.average(df[col], weights=df[‘UGDS‘])
        geometric = gmean(df[col])
        harmonic = hmean(df[col])
        df_means[col] = [arithmetic, weighted, geometric, harmonic]
    df_means[‘count‘] = len(df)
    return df_means.astype(int)
college2.groupby(‘STABBR‘)    .filter(lambda x: len(x) != 1)    .groupby(‘STABBR‘)    .apply(calculate_means).head(10)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

SATMTMID SATVRMID count
STABBR
AL Arithmetic 504 508 21
Weighted 536 533 21
Geometric 500 505 21
Harmonic 497 502 21
... ... ... ... ...
AR Geometric 514 489 16
Harmonic 513 487 16
AZ Arithmetic 536 538 6
Weighted 569 557 6

10 rows × 3 columns

原文地址:https://www.cnblogs.com/shiyushiyu/p/9757766.html

时间: 2024-11-05 23:26:18

Pandas Cookbook -- 07 分组聚合、过滤、转换的相关文章

crm使用FetchXml分组聚合查询

/* 创建者:菜刀居士的博客 * 创建日期:2014年07月09号 */ namespace Net.CRM.FetchXml { using System; using Microsoft.Xrm.Sdk; using Microsoft.Xrm.Sdk.Query; /// <summary> /// 使用FetchXml聚合查询,分组依据 /// </summary> public class FetchXmlExtension { /// <summary> /

mongodb 分组聚合查询

MongoDB,分组,聚合 使用聚合,db.集合名.aggregate- 而不是find 管道在Unix和Linux中一般用于将当前命令的输出结果作为下一个命令的参数.MongoDB的聚合管道将MongoDB文档在一个管道处理完毕后将结果传递给下一个管道处理.管道操作是可以重复的. 每一个操作符(集合)都会接受一连串的文档,对这些文档做一些类型转换,最后将转换后的文档作为结果传递给下一个操作符,对于最后一个操作符,是将结果返回给客户端 //分组(这里制定了分组字段 $+字段名)//这里可以理解为

pandas中的分组技术

目录 1  分组操作 1.1  按照列进行分组 1.2  按照字典进行分组 1.3  根据函数进行分组 1.4  按照list组合 1.5  按照索引级别进行分组 2  分组运算 2.1  agg 2.2  transform 2.3  apply 3  利用groupby技术多进程处理DataFrame 我们在这里要讲一个很常用的技术, 就是所谓的分组技术, 这个在数据库中是非常常用的, 要去求某些分组的统计量, 那么我们需要知道在pandas里面, 这些分组技术是怎么实现的. 分组操作 我们

Atitit &#160;数据存储的分组聚合 groupby的实现attilax总结

Atitit  数据存储的分组聚合 groupby的实现attilax总结 1. 聚合操作1 1.1. a.标量聚合 流聚合1 1.2. b.哈希聚合2 1.3. 所有的最优计划的选择都是基于现有统计信息来评估3 1.4. 参考资料3 1. 聚合操作 聚合也是我们在写T-SQL语句的时候经常遇到的,我们来分析一下一些常用的聚合操作运算符的特性和可优化项. 1.1. a.标量聚合 流聚合 标量聚合是一种常用的数据聚合方式,比如我们写的语句中利用的以下聚合函数:MAX().MIN().AVG().C

Elasticsearch分组聚合-查询每个A_logtype下有多少数据

Elasticsearch分组聚合 1.查询指定索引下每个A_logtype有多少数据 curl -XPOST 'localhost:19200/ylchou-0-2015-10-07/_search?pretty' -d ' { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "A_logtype" } } }

浅析MySQL使用 GROUP BY 分组聚合与细分聚合

1. 聚合函数(Aggregate Function) MySQL(5.7 ) 官方文档中给出的聚合函数列表(图片)如下: 详情点击https://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html . 除非另有说明,否则聚合函数都会忽略空值(NULL values). 2. 聚合函数的使用 聚合函数通常对 GROUP BY 语句进行分组后的每个分组起作用,即,如果在查询语句中不使用 GROUP BY 对结果集分组,则聚合函数就对结果集

窗口聚合函数与分组聚合函数的异同

窗口聚合函数与分组聚合函数的功能是相同的:唯一不同的是,分组聚合函数通过分组查询来进行,而窗口聚合函数通过OVER子句定义的窗口来进行. --<T-SQL性能调优秘笈---基于SQL Server2012窗口函数>2.1.1窗口聚合函数描述

分页,sql分组聚合

分页 SELECT TOP 页大小 * FROM    (        SELECT ROW_NUMBER() OVER (ORDER BY id) AS RowNumber,* FROM table1    )   as A  WHERE RowNumber > 页大小*(页数-1) 分组聚合 create table tb(id int, value varchar(10))insert into tb values(1, 'aa')insert into tb values(1, 'bb

Dubbo -- 系统学习 笔记 -- 示例 -- 分组聚合

Dubbo -- 系统学习 笔记 -- 目录 示例 想完整的运行起来,请参见:快速启动,这里只列出各种场景的配置方式 分组聚合 按组合并返回结果,比如菜单服务,接口一样,但有多种实现,用group区分,现在消费方需从每种group中调用一次返回结果,合并结果返回,这样就可以实现聚合菜单项. 从2.1.0版本开始支持 配置如:(搜索所有分组) <dubbo:reference interface="com.xxx.MenuService" group="*" m