python 数据分析--数据处理工具Pandas(2) / 憋错料

在前面的学习中主要了解了Pandas如何构造序列和数据框，如何读取和写入各种格式的数据，以及如何对数据进行初步描述，本文将进一步了解Pandas如何处理字符串和日期数据，数据清洗，获取数据子集，透视表，分组聚合操作等内容。

4. Pandas处理字符串和日期数据

待处理的数据表

数据处理要求:

更改出生日期birthday和手机号tel两个字段的数据类型。

根据出生日期birthday和开始工作日期start_work两个字段新增年龄和工龄两个字段。

将手机号tel的中间四位隐藏起来。

根据邮箱信息新增邮箱域名字段。

基于other字段取出每个人员的专业信息。

import pandas as pd

#读入数据
employee_info = pd.read_excel(r"E:/Data/3/data_test03.xlsx",header=0)
employee_info.dtypes

name                  object
gender                object
birthday              object
start_work    datetime64[ns]
income                 int64
tel                    int64
email                 object
other                 object
dtype: object

# 更改数据类型
employee_info.birthday = pd.to_datetime(employee_info.birthday, format="%Y/%m/%d")
employee_info.tel = employee_info.tel.astype(‘str‘)
employee_info.dtypes

name                  object
gender                object
birthday      datetime64[ns]
start_work    datetime64[ns]
income                 int64
tel                   object
email                 object
other                 object
dtype: object

# 新增年龄和工龄字段
# 年龄 = 当天日期的年份 - 生日那一天的年份
# 工龄 = 当天日期的年份 - 开始工作那一天的年份
employee_info[‘age‘] = pd.datetime.today().year - employee_info.birthday.dt.year
employee_info[‘workage‘] = pd.datetime.today().year - employee_info.start_work.dt.year

# 新增邮箱域名字段
# 字符串分割、巧用了匿名函数 lambda
# split分出来的数据有两部分[邮箱名,域名]，域名的索引为1

employee_info[‘email_domain‘] = employee_info.email.apply(func = lambda x: x.split(‘@‘)[1])
employee_info

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	name	gender	birthday	start_work	income	tel	email	other	age	workage	email_domain
0	赵一	男	1989-08-10	2012-09-08	15000	13611011234	[email protected]	{教育：本科，专业：电子商务，爱好：运动}	31	8	qq.com
1	王二	男	1990-10-02	2014-03-06	12500	13500012234	[email protected]	{教育：大专，专业：汽修，爱好：}	30	6	163.com
2	张三	女	1987-03-12	2009-01-08	18500	13515273330	[email protected]	{教育：本科，专业：数学，爱好：打篮球}	33	11	qq.com
3	李四	女	1991-08-16	2014-06-04	13000	13923673388	[email protected]	{教育：硕士，专业：统计学，爱好：唱歌}	29	6	gmail.com
4	刘五	女	1992-05-24	2014-08-10	8500	17823117890	[email protected]	{教育：本科，专业：美术，爱好：}	28	6	qq.com
5	雷六	女	1986-12-10	2010-03-10	15000	13712345612	[email protected]	{教育：本科，专业：化学，爱好：钓鱼}	34	10	126.com
6	贾七	男	1993-04-10	2015-08-01	9000	13178734511	[email protected]	{教育：硕士，专业：物理，爱好：健身}	27	5	136.com
7	吴八	女	1988-07-19	2014-10-12	13500	17822335317	[email protected]	{教育：本科，专业：政治学，爱好：读书}	32	6	qq.com

# 隐藏电话号码中间四位数
# 字符串替换，巧用了匿名函数lambda
employee_info.tel = employee_info.tel.apply(func = lambda x: x.replace(x[3:7],‘****‘))
employee_info

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	name	gender	birthday	start_work	income	tel	email	other	age	workage	email_domain
0	赵一	男	1989-08-10	2012-09-08	15000	136****1234	[email protected]	{教育：本科，专业：电子商务，爱好：运动}	31	8	qq.com
1	王二	男	1990-10-02	2014-03-06	12500	135****2234	[email protected]	{教育：大专，专业：汽修，爱好：}	30	6	163.com
2	张三	女	1987-03-12	2009-01-08	18500	135****3330	[email protected]	{教育：本科，专业：数学，爱好：打篮球}	33	11	qq.com
3	李四	女	1991-08-16	2014-06-04	13000	139****3388	[email protected]	{教育：硕士，专业：统计学，爱好：唱歌}	29	6	gmail.com
4	刘五	女	1992-05-24	2014-08-10	8500	178****7890	[email protected]	{教育：本科，专业：美术，爱好：}	28	6	qq.com
5	雷六	女	1986-12-10	2010-03-10	15000	137****5612	[email protected]	{教育：本科，专业：化学，爱好：钓鱼}	34	10	126.com
6	贾七	男	1993-04-10	2015-08-01	9000	131****4511	[email protected]	{教育：硕士，专业：物理，爱好：健身}	27	5	136.com
7	吴八	女	1988-07-19	2014-10-12	13500	178****5317	[email protected]	{教育：本科，专业：政治学，爱好：读书}	32	6	qq.com

# 根据other 字段提取每个人的专业信息
# 用正则表达式匹配专业字段，主要在匹配时 ： 和 ， 均为中文输入法，英文无法匹配的
employee_info[‘profession‘] = employee_info.other.str.findall(‘专业：(.*?)，‘)

# findall 提取出来的数据带[] 去除[]
employee_info.profession = employee_info.profession.astype(‘str‘)
employee_info.profession = employee_info.profession.apply(func = lambda x: x.replace(x[:],x[1:-1]))
employee_info.profession = employee_info.profession.apply(func = lambda x: x.replace(x[0],‘ ‘))
employee_info.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	name	gender	birthday	start_work	income	tel	email	other	age	workage	email_domain	profession
0	赵一	男	1989-08-10	2012-09-08	15000	136****1234	[email protected]	{教育：本科，专业：电子商务，爱好：运动}	31	8	qq.com	电子商务
1	王二	男	1990-10-02	2014-03-06	12500	135****2234	[email protected]	{教育：大专，专业：汽修，爱好：}	30	6	163.com	汽修
2	张三	女	1987-03-12	2009-01-08	18500	135****3330	[email protected]	{教育：本科，专业：数学，爱好：打篮球}	33	11	qq.com	数学
3	李四	女	1991-08-16	2014-06-04	13000	139****3388	[email protected]	{教育：硕士，专业：统计学，爱好：唱歌}	29	6	gmail.com	统计学
4	刘五	女	1992-05-24	2014-08-10	8500	178****7890	[email protected]	{教育：本科，专业：美术，爱好：}	28	6	qq.com	美术

# 剔除birthday,start_work和other变量
# 需要将axis参数设置为1，因为默认drop方法是用来删除数据框中的行记录。
employee_info.drop([‘birthday‘,‘start_work‘,‘other‘], axis=1,inplace=True)
employee_info

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	name	gender	income	tel	email	age	workage	email_domain	profession
0	赵一	男	15000	136****1234	[email protected]	31	8	qq.com	电子商务
1	王二	男	12500	135****2234	[email protected]	30	6	163.com	汽修
2	张三	女	18500	135****3330	[email protected]	33	11	qq.com	数学
3	李四	女	13000	139****3388	[email protected]	29	6	gmail.com	统计学
4	刘五	女	8500	178****7890	[email protected]	28	6	qq.com	美术
5	雷六	女	15000	137****5612	[email protected]	34	10	126.com	化学
6	贾七	男	9000	131****4511	[email protected]	27	5	136.com	物理
7	吴八	女	13500	178****5317	[email protected]	32	6	qq.com	政治学

5. Pandas 数据清洗

在数据处理过程中，一般都需要进行数据的清洗工作，数据清洗过程主要负责看数据集是否存在重复、是否存在缺失、数据是否具有完整性和一致性、数据中是否存在异常值等。这些问题都不利于数据分析，需要加以处理。

5.1 重复观测处理

在搜集数据过程中，可能会存在重复观测的出现，例如通过网络爬虫，就比较容易产生重复数据。

上面的数据就是通过爬虫获得某APP市场中电商类APP的下载量数据（部分），通过肉眼，是能够发现这10行数据中的重复项的，例如，唯品会出现了两次、当当出现了三次。如果搜集上来的数据不是10行，而是10万行，甚至更多时，就无法通过肉眼的方式检测数据是否存在重复项了。

import pandas as pd 

dsapp = pd.read_excel(r"E:/Data/3/data_test04.xlsx")

# 重复观测检测
#使用duplicated方法进行验证，但是该方法返回的是数据集每一行的检验结果，即10行数据会返回10个bool值。
# 加any进行判断，只要有一个检测到，就代表有

print("是否存在重复观测:\n",any(dsapp.duplicated()))

是否存在重复观测:
 True

#删除重复数据

dsapp.drop_duplicates(inplace=True)
dsapp

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	appcategory	appname	comments	install	love	size	update
0	网上购物-商城-团购-优惠-快递	每日优鲜	1297	204.7万	89.00%	15.16MB	2017年10月11日
1	网上购物-商城	苏宁易购	577	7996.8万	73.00%	58.9MB	2017年09月21日
2	网上购物-商城-优惠	唯品会	2543	7090.1万	86.00%	41.43MB	2017年10月13日
4	网上购物-商城	拼多多	1921	3841.9万	95.00%	13.35MB	2017年10月11日
5	网上购物-商城-优惠	寺库奢侈品	1964	175.4万	100.00%	17.21MB	2017年09月30日
6	网上购物-商城	淘宝	14244	4.6亿	68.00%	73.78MB	2017年10月13日
7	网上购物-商城-团购-优惠	当当	134	1615.3万	61.00%	37.01MB	2017年10月17日

5.2 缺失值处理

缺失值是指数据集中的某些观测存在遗漏的指标值，缺失值的存在同样会影响到数据分析和挖掘的结果。导致观测的缺失可能有两方面原因，一方面是人为原因（如记录过程中的遗漏、个人隐私而不愿透露等），另一方面是机器或设备的故障所导致（如断电或设备老化等原因）。一般而言，当遇到缺失值（Python中用NaN表示）时，可以采用三种方法处置，分别是删除法、替换法和插补法。

上面的数据来自于某游戏公司的用户注册信息（仅以10行记录为例，该数据集中存在4条红色标注的缺失观测。

import pandas as pd 

Game_user = pd.read_excel(r"E:\Data\3\data_test05.xlsx")
print("数据集是否存在缺失值:\n",any(Game_user.isnull()))

数据集是否存在缺失值:
 True

5.2.1 删除法

# 1. 删除法处理
# 在副本上删除所有缺失的行，Game_user 不变
Game_user.dropna()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	uid	regit_date	gender	age	income
0	81200457	2016-10-30	M	23.0	6500.0
1	81201135	2016-11-08	M	27.0	10300.0
3	84639281	2017-04-17	M	26.0	6000.0
6	63881943	2015-10-07	M	21.0	10000.0
8	77638351	2016-07-12	M	25.0	18000.0

# 删除缺失值最多的那一列变量
Game_user.drop(‘age‘,axis=1)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	uid	regit_date	gender	income
0	81200457	2016-10-30	M	6500.0
1	81201135	2016-11-08	M	10300.0
2	80043782	2016-10-13	F	13500.0
3	84639281	2017-04-17	M	6000.0
4	73499801	2016-03-21	NaN	4500.0
5	72399510	2016-01-18	M	NaN
6	63881943	2015-10-07	M	10000.0
7	35442690	2015-04-10	F	5800.0
8	77638351	2016-07-12	M	18000.0
9	85200189	2017-05-18	M	NaN

5.2.2 替换法

# 2. 替换法
# 缺失值用前一行值填充
Game_user.fillna(method=‘ffill‘)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	uid	regit_date	gender	age	income
0	81200457	2016-10-30	M	23.0	6500.0
1	81201135	2016-11-08	M	27.0	10300.0
2	80043782	2016-10-13	F	27.0	13500.0
3	84639281	2017-04-17	M	26.0	6000.0
4	73499801	2016-03-21	M	26.0	4500.0
5	72399510	2016-01-18	M	19.0	4500.0
6	63881943	2015-10-07	M	21.0	10000.0
7	35442690	2015-04-10	F	21.0	5800.0
8	77638351	2016-07-12	M	25.0	18000.0
9	85200189	2017-05-18	M	22.0	18000.0

# 缺失值用后一行填充
Game_user.fillna(method=‘bfill‘)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	uid	regit_date	gender	age	income
0	81200457	2016-10-30	M	23.0	6500.0
1	81201135	2016-11-08	M	27.0	10300.0
2	80043782	2016-10-13	F	26.0	13500.0
3	84639281	2017-04-17	M	26.0	6000.0
4	73499801	2016-03-21	M	19.0	4500.0
5	72399510	2016-01-18	M	19.0	10000.0
6	63881943	2015-10-07	M	21.0	10000.0
7	35442690	2015-04-10	F	25.0	5800.0
8	77638351	2016-07-12	M	25.0	18000.0
9	85200189	2017-05-18	M	22.0	NaN

method参数可以接受‘ffill‘和‘bfill‘两种值，分别代表前向填充和后向填充。前向填充是指用缺失值的前一个值替换（如左表所示），而后向填充则表示用缺失值的后一个值替换（如右表所示）。右表中的最后一个记录仍包含缺失值，是因为后向填充法找不到该缺失值的后一个值用于替换。缺失值的前向填充或后向填充一般适用于时间序列型的数据集，因为这样的数据前后具有连贯性，而一般的独立性样本并不适用该方法。

# 常数替换
Game_user.fillna(value = 0)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	uid	regit_date	gender	age	income
0	81200457	2016-10-30	M	23.0	6500.0
1	81201135	2016-11-08	M	27.0	10300.0
2	80043782	2016-10-13	F	0.0	13500.0
3	84639281	2017-04-17	M	26.0	6000.0
4	73499801	2016-03-21	0	0.0	4500.0
5	72399510	2016-01-18	M	19.0	0.0
6	63881943	2015-10-07	M	21.0	10000.0
7	35442690	2015-04-10	F	0.0	5800.0
8	77638351	2016-07-12	M	25.0	18000.0
9	85200189	2017-05-18	M	22.0	0.0

# 统计值替换
Game_user.fillna(value= {‘gender‘:Game_user.gender.mode()[0],
                         ‘age‘:Game_user.age.median(),
                        ‘income‘:Game_user.income.mean()})

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	uid	regit_date	gender	age	income
0	81200457	2016-10-30	M	23.0	6500.0
1	81201135	2016-11-08	M	27.0	10300.0
2	80043782	2016-10-13	F	23.0	13500.0
3	84639281	2017-04-17	M	26.0	6000.0
4	73499801	2016-03-21	M	23.0	4500.0
5	72399510	2016-01-18	M	19.0	9325.0
6	63881943	2015-10-07	M	21.0	10000.0
7	35442690	2015-04-10	F	23.0	5800.0
8	77638351	2016-07-12	M	25.0	18000.0
9	85200189	2017-05-18	M	22.0	9325.0

如上代码并没有实际改变df数据框的结果，因为dropna、drop和fillna方法并没有使inplace参数设置为True。可以在实际的学习和工作中挑选一个适当的缺失值处理方法，然后将该方法中的inplace参数设置为True，进而可以真正地改变你所处理的数据集。

5.3 异常值处理

异常值是指那些远离正常值的观测，即“不合群”观测。导致异常值的出现一般是人为的记录错误或者是设备的故障等，异常值的出现会对模型的创建和预测产生严重的后果。当然异常值也不一定都是坏事，有些情况下，通过寻找异常值就能够给业务带来良好的发展，如销毁“钓鱼”网站、关闭“薅羊毛”用户的权限等。

这两种方法的选择标准如下，如果数据近似服从正态分布时，优先选择n个标准差法，因为数据的分布相对比较对称；否则优先选择箱线图法，因为分位数并不会受到极端值的影响。当数据存在异常时，一般可以使用删除法将异常值删除（前提是异常观测的比例不能太大）、替换法（可以考虑使用低于判别上限的最大值或高于判别下限的最小值替换、使用均值或中位数替换等）。下面将以年为单位的太阳黑子个数为例（时间范围：1700—1988），识别并处理异常值：

# 1. 看两种方法是否都存在异常值

import pandas as pd

#数据读入
sunspots = pd.read_table(r"E:\Data\3\sunspots.csv", sep = ‘,‘)

# 标准差法检测异常值
xmean = sunspots.counts.mean()
xstd = sunspots.counts.std()
print("标准差法异常上限检测:\n",any(sunspots > xmean + 2 * xstd))
print("标准差法异常下限检测:\n",any(sunspots < xmean - 2 * xstd))

# 箱线法检测异常值
Q1 = sunspots.counts.quantile(q = 0.25)
Q3 = sunspots.counts.quantile(q = 0.75)
IQR = Q3 - Q1
print("箱线法异常上限检测:\n",any(sunspots  > Q3 + 1.5 * IQR ))
print("箱线法异常下限检测:\n",any(sunspots < Q1 - 1.5 * IQR ))

标准差法异常上限检测:
 True
标准差法异常下限检测:
 True
箱线法异常上限检测:
 True
箱线法异常下限检测:
 True

不管是标准差检验法还是箱线图检验法，都发现太阳黑子数据中存在异常值，而且异常值都是超过上限临界值的。接下来，通过绘制太阳黑子数量的直方图和核密度曲线图，用于检验数据是否近似服从正态分布，进而选择一个最终的异常值判别方法：

# 2. 绘制直方图和核密度曲线图，确定采用何种异常判别法。
import matplotlib.pyplot as plt

plt.style.use(‘ggplot‘)
sunspots.counts.plot(kind = ‘hist‘, bins =30, normed = True)
sunspots.counts.plot(kind = ‘kde‘)
plt.show()

很明显，不管是直方图还是核密度曲线，都不服从正太分布，所以采用箱线法进行异常检测，此处也选用替换法来处理异常值。

# 替换法处理异常值：
print("异常值替换前的数据统计特征:\n", sunspots.counts.describe())
print("\n")

#箱线图中的异常值判别上限
UL = Q3 + 1.5 * IQR
print("判别异常值的上限临界值:\n", UL)

# 从数据中找出低于判别上限的最大值
replace_value = sunspots.counts[sunspots.counts < UL].max()
sunspots.counts[sunspots.counts > UL] = replace_value
print("\n")
print("异常值替换后的数据统计特征:\n", sunspots.counts.describe())

异常值替换前的数据统计特征:
 count    289.000000
mean      48.613495
std       39.474103
min        0.000000
25%       15.600000
50%       39.000000
75%       68.900000
max      190.200000
Name: counts, dtype: float64

判别异常值的上限临界值:
 148.85000000000002

异常值替换后的数据统计特征:
 count    289.000000
mean      48.066090
std       37.918895
min        0.000000
25%       15.600000
50%       39.000000
75%       68.900000
max      141.700000
Name: counts, dtype: float64

D:\Anaconda\lib\site-packages\ipykernel_launcher.py:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()

6. 获取数据子集

有时数据读入后并不是对整体数据进行分析，而是数据中的部分子集，例如，对于地铁乘客量可能只关心某些时间段的流量、对于商品的交易可能只需要分析某些颜色的价格变动、对于医疗诊断数据可能只对某个年龄段的人群感兴趣等。

在Pandas模块中实现数据框子集的获取可以使用iloc、loc和ix三种“方法”，这三种方法既可以对数据行进行筛选，也可以实现变量的挑选它们的语法可以表示成[rows_select,cols_select]。

df1 = pd.DataFrame({‘name‘:[‘张三‘,‘李四‘,‘王二‘,‘丁一‘,‘李五‘],
                    ‘gender‘:[‘男‘,‘女‘,‘女‘,‘女‘,‘男‘],
                    ‘age‘:[23,26,22,25,27]},
                    columns = [‘name‘,‘gender‘,‘age‘])
df1

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	name	gender	age
0	张三	男	23
1	李四	女	26
2	王二	女	22
3	丁一	女	25
4	李五	男	27

# 取出所有女性的姓名和年龄
df1.iloc[1:4,[0,2]]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	name	age
1	李四	26
2	王二	22
3	丁一	25

df1.loc[1:3,[‘name‘,‘age‘]]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	name	age
1	李四	26
2	王二	22
3	丁一	25

df1.ix[1:3,[0,2]]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	name	age
1	李四	26
2	王二	22
3	丁一	25

# 假如数据没有行号
df2 = df1.set_index(‘name‘)
df2

df2.iloc[1:4,:]
df2.loc[[‘李四‘,‘王二‘,‘丁一‘],:]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	gender	age
name
李四	女	26
王二	女	22
丁一	女	25

# df2.ix[1:4,:]

在上面的df1数据集中，如何返回所有男性的姓名和年龄,如果是基于条件的记录筛选，只能使用loc和ix两种方法。正如代码所示，对iloc方法的那行代码做注释，是因为iloc不允许使用条件筛选。

# 使用筛选条件，取出所有男性的姓名和年龄
# df1.iloc[df1.gender == ‘男‘,]
df1.loc[df1.gender == ‘男‘,[‘name‘,‘age‘]]
df1.ix[df1.gender == ‘男‘,[‘name‘,‘age‘]]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	name	age
0	张三	23
4	李五	27

7. 透视表、合并与连接、分组聚合

7.1 透视表

Pandas模块提供了实现透视表功能的pivot_table函数，*该功能的主要目的就是实现数据的汇总统计。例如，按照某个分组变量统计商品的平均价格、销售数量、最大利润等，或者按照某两个分组变量构成统计学中的列联表（计数统计），甚至是基于多个分组变量统计各组合下的均值、中位数、总和等。

data：指定需要构造透视表的数据集。
values：指定需要拉入“数值”框的字段列表。
index：指定需要拉入“行标签”框的字段列表。
columns：指定需要拉入“列标签”框的字段列表。
aggfunc：指定数值的统计函数，默认为统计均值，也可以指定numpy模块中的其他统计函数。
fill_value：指定一个标量，用于填充缺失值。
margins：bool类型参数，是否需要显示行或列的总计值，默认为False。
dropna：bool类型参数，是否需要删除整列为缺失的字段，默认为True。
margins_name：指定行或列的总计名称，默认为All。

# 数据读取
diamonds = pd.read_table(r"E:/Data/3/diamonds.csv", sep = ‘,‘)

# 单个分组变量的均值统计
pd.pivot_table(data= diamonds, index = ‘color‘,values = ‘price‘,
               margins=True, margins_name=‘总计‘)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	price
color
D	3169.954096
E	3076.752475
F	3724.886397
G	3999.135671
H	4486.669196
I	5091.874954
J	5323.818020
总计	3932.799722

import numpy as np
pd.pivot_table(data=diamonds, index=‘clarity‘, columns=‘cut‘, values = ‘carat‘,
               aggfunc= np.size, margins=True, margins_name=‘总计‘)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

cut	Fair	Good	Ideal	Premium	Very Good	总计
clarity
I1	210.0	96.0	146.0	205.0	84.0	741.0
IF	9.0	71.0	1212.0	230.0	268.0	1790.0
SI1	408.0	1560.0	4282.0	3575.0	3240.0	13065.0
SI2	466.0	1081.0	2598.0	2949.0	2100.0	9194.0
VS1	170.0	648.0	3589.0	1989.0	1775.0	8171.0
VS2	261.0	978.0	5071.0	3357.0	2591.0	12258.0
VVS1	17.0	186.0	2047.0	616.0	789.0	3655.0
VVS2	69.0	286.0	2606.0	870.0	1235.0	5066.0
总计	1610.0	4906.0	21551.0	13791.0	12082.0	53940.0

7.2 多表合并

Pandas模块同样提供了关于多表之间的合并操作concat函数:

objs：指定需要合并的对象，可以是序列、数据框或面板数据构成的列表。
axis：指定数据合并的轴，默认为0，表示合并多个数据的行，如果为1，就表示合并多个数据的列。
join：指定合并的方式，默认为outer，表示合并所有数据，如果改为inner，表示合并公共部分的数据。
join_axes：合并数据后，指定保留的数据轴。
ignore_index：bool类型的参数，表示是否忽略原数据集的索引，默认为False，如果设True，就表示忽略原索引并生成新索引。
keys：为合并后的数据添加新索引，用于区分各个数据部分。

df1 = pd.DataFrame({‘name‘:[‘张三‘,‘李四‘,‘王二‘],
                    ‘gender‘:[‘男‘,‘女‘,‘女‘],
                    ‘age‘:[23,26,22,]} )
df2 = pd.DataFrame({‘name‘:[‘丁一‘,‘李五‘],
                    ‘gender‘:[‘女‘,‘男‘],
                    ‘age‘:[25,27]} )
pd.concat([df1,df2],keys = [‘df1‘,‘df2‘], axis=0)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

		age	gender	name
df1	0	23	男	张三
	1	26	女	李四
	2	22	女	王二
df2	0	25	女	丁一
df2	1	27	男	李五

df2 = pd.DataFrame({‘Name‘:[‘丁一‘,‘李五‘],
                    ‘gender‘:[‘女‘,‘男‘],
                    ‘age‘:[25,27]} )
pd.concat([df1,df2],keys = [‘df1‘,‘df2‘])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

		Name	age	gender	name
df1	0	NaN	23	男	张三
	1	NaN	26	女	李四
	2	NaN	22	女	王二
df2	0	丁一	25	女	NaN
df2	1	李五	27	男	NaN

7.3 多表连接

Pandas模块同样提供了关于多表之间的连接操作merge函数，函数的最大缺点是，每次只能操作两张数据表的连接，如果有n张表需要连接，则必须经过n-1次的merge函数使用。

left：指定需要连接的主表。
right：指定需要连接的辅表。
how：指定连接方式，默认为inner内连，还有其他选项，如左连left、右连right和外连
outer。on：指定连接两张表的共同字段。
left_on：指定主表中需要连接的共同字段。
right_on：指定辅表中需要连接的共同字段。
left_index：bool类型参数，是否将主表中的行索引用作表连接的共同字段，默认为False。
right_index：bool类型参数，是否将辅表中的行索引用作表连接的共同字段，默认为False。
sort：bool类型参数，是否对连接后的数据按照共同字段排序，默认为False。
suffixes：如果数据连接的结果中存在重叠的变量名，则使用各自的前缀进行区分。

# 构造数据集
df3 = pd.DataFrame({‘id‘:[1,2,3,4,5],‘name‘:[‘张三‘,‘李四‘,‘王二‘,‘丁一‘,‘赵五‘],‘age‘:[27,24,25,23,25],‘gender‘:[‘男‘,‘男‘,‘男‘,‘女‘,‘女‘]})
df4 = pd.DataFrame({‘Id‘:[1,2,2,4,4,4,5],‘kemu‘:[‘科目1‘,‘科目1‘,‘科目2‘,‘科目1‘,‘科目2‘,‘科目3‘,‘科目1‘],‘score‘:[83,81,87,75,86,74,88]})
df5 = pd.DataFrame({‘id‘:[1,3,5],‘name‘:[‘张三‘,‘王二‘,‘赵五‘],‘income‘:[13500,18000,15000]})

# 三表的数据连接
# 首先df3和df4连接
merge1 = pd.merge(left = df3, right = df4, how = ‘left‘, left_on=‘id‘, right_on=‘Id‘)
merge1

# 再将连接结果与df5连接
merge2 = pd.merge(left = merge1, right = df5, how = ‘left‘)
merge2

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	age	gender	id	name	Id	kemu	score	income
0	27	男	1	张三	1.0	科目1	83.0	13500.0
1	24	男	2	李四	2.0	科目1	81.0	NaN
2	24	男	2	李四	2.0	科目2	87.0	NaN
3	25	男	3	王二	NaN	NaN	NaN	18000.0
4	23	女	4	丁一	4.0	科目1	75.0	NaN
5	23	女	4	丁一	4.0	科目2	86.0	NaN
6	23	女	4	丁一	4.0	科目3	74.0	NaN
7	25	女	5	赵五	5.0	科目1	88.0	15000.0

7.4 分组聚合

# 通过groupby方法，指定分组变量
grouped = diamonds.groupby(by = [‘color‘,‘cut‘])

# 对分组变量进行统计汇总
result = grouped.aggregate({‘color‘:np.size, ‘carat‘:np.min, ‘price‘:np.mean})

# 调整变量名的顺序
result = pd.DataFrame(result, columns=[‘color‘,‘carat‘,‘price‘])

# 数据集重命名
result.rename(columns={‘color‘:‘counts‘,‘carat‘:‘min_weight‘,‘price‘:‘avg_price‘}, inplace=True)

# 将行索引变量数据框的变量
# result.reset_index(inplace=True)
result

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

		counts	min_weight	avg_price
color	cut
D	Fair	163	0.25	4291.061350
	Good	662	0.23	3405.382175
	Ideal	2834	0.20	2629.094566
	Premium	1603	0.20	3631.292576
	Very Good	1513	0.23	3470.467284
E	Fair	224	0.22	3682.312500
	Good	933	0.23	3423.644159
	Ideal	3903	0.20	2597.550090
	Premium	2337	0.20	3538.914420
	Very Good	2400	0.20	3214.652083
F	Fair	312	0.25	3827.003205
	Good	909	0.23	3495.750275
	Ideal	3826	0.23	3374.939362
	Premium	2331	0.20	4324.890176
	Very Good	2164	0.23	3778.820240
G	Fair	314	0.23	4239.254777
	Good	871	0.23	4123.482204
	Ideal	4884	0.23	3720.706388
	Premium	2924	0.23	4500.742134
	Very Good	2299	0.23	3872.753806
H	Fair	303	0.33	5135.683168
	Good	702	0.25	4276.254986
	Ideal	3115	0.23	3889.334831
	Premium	2360	0.23	5216.706780
	Very Good	1824	0.23	4535.390351
I	Fair	175	0.41	4685.445714
	Good	522	0.30	5078.532567
	Ideal	2093	0.23	4451.970377
	Premium	1428	0.23	5946.180672
	Very Good	1204	0.24	5255.879568
J	Fair	119	0.30	4975.655462
	Good	307	0.28	4574.172638
	Ideal	896	0.23	4918.186384
	Premium	808	0.30	6294.591584
	Very Good	678	0.24	5103.513274

原文地址：https://www.cnblogs.com/sinlearn/p/12665753.html

时间： 2024-08-27 00:37:16

python 数据分析--数据处理工具Pandas(2)

4. Pandas处理字符串和日期数据

5. Pandas 数据清洗

5.1 重复观测处理

5.2 缺失值处理

5.2.1 删除法

5.2.2 替换法

5.3 异常值处理

6. 获取数据子集

7. 透视表、合并与连接、分组聚合

7.1 透视表

7.2 多表合并

7.3 多表连接

7.4 分组聚合

python 数据分析--数据处理工具Pandas(2)的相关文章

python 数据分析--数据处理工具Pandas（1）

Python大数据处理模块Pandas

Python——数据分析，Numpy,Pandas,matplotlib

《Python 数据分析》笔记——pandas

Python 数据分析包：pandas 基础

快速学习 Python 数据分析包之 pandas

python数据分析之numpy+pandas+scipy+matplotlib+scikit-learn安装

学习Python大数据处理模块Pandas

python数据分析-数据处理