Pandas Cookbook -- 06索引对齐

索引对齐

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

1 索引方法

college = pd.read_csv(‘data/college.csv‘)
college.iloc[:5,:5]

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY STABBR HBCU MENONLY
0 Alabama A & M University Normal AL 1.0 0.0
1 University of Alabama at Birmingham Birmingham AL 0.0 0.0
2 Amridge University Montgomery AL 0.0 0.0
3 University of Alabama in Huntsville Huntsville AL 0.0 0.0
4 Alabama State University Montgomery AL 1.0 0.0

1.1 基础方法

提取所有的列索引

columns = college.columns
columns
Index([‘INSTNM‘, ‘CITY‘, ‘STABBR‘, ‘HBCU‘, ‘MENONLY‘, ‘WOMENONLY‘, ‘RELAFFIL‘,
       ‘SATVRMID‘, ‘SATMTMID‘, ‘DISTANCEONLY‘, ‘UGDS‘, ‘UGDS_WHITE‘,
       ‘UGDS_BLACK‘, ‘UGDS_HISP‘, ‘UGDS_ASIAN‘, ‘UGDS_AIAN‘, ‘UGDS_NHPI‘,
       ‘UGDS_2MOR‘, ‘UGDS_NRA‘, ‘UGDS_UNKN‘, ‘PPTUG_EF‘, ‘CURROPER‘, ‘PCTPELL‘,
       ‘PCTFLOAN‘, ‘UG25ABV‘, ‘MD_EARN_WNE_P10‘, ‘GRAD_DEBT_MDN_SUPP‘],
      dtype=‘object‘)

用values属性,访问底层的NumPy数组

columns.values
array([‘INSTNM‘, ‘CITY‘, ‘STABBR‘, ‘HBCU‘, ‘MENONLY‘, ‘WOMENONLY‘,
       ‘RELAFFIL‘, ‘SATVRMID‘, ‘SATMTMID‘, ‘DISTANCEONLY‘, ‘UGDS‘,
       ‘UGDS_WHITE‘, ‘UGDS_BLACK‘, ‘UGDS_HISP‘, ‘UGDS_ASIAN‘, ‘UGDS_AIAN‘,
       ‘UGDS_NHPI‘, ‘UGDS_2MOR‘, ‘UGDS_NRA‘, ‘UGDS_UNKN‘, ‘PPTUG_EF‘,
       ‘CURROPER‘, ‘PCTPELL‘, ‘PCTFLOAN‘, ‘UG25ABV‘, ‘MD_EARN_WNE_P10‘,
       ‘GRAD_DEBT_MDN_SUPP‘], dtype=object)

取出该数组的第6个值

columns[5]
‘WOMENONLY‘

取出该数组的第2\9\11

columns[[1,8,10]]
Index([‘CITY‘, ‘SATMTMID‘, ‘UGDS‘], dtype=‘object‘)

逆序切片选取

columns[-7:-4]
Index([‘PPTUG_EF‘, ‘CURROPER‘, ‘PCTPELL‘], dtype=‘object‘)

索引有许多和Series和DataFrame相同的方法

columns.min(), columns.max(), columns.isnull().sum()
(‘CITY‘, ‘WOMENONLY‘, 0)

索引对象可以直接通过字符串方法修改,返回一个copy,索引对象本身是不可变类型,修改其本身会导致报错

columns + ‘_A‘
Index([‘INSTNM_A‘, ‘CITY_A‘, ‘STABBR_A‘, ‘HBCU_A‘, ‘MENONLY_A‘, ‘WOMENONLY_A‘,
       ‘RELAFFIL_A‘, ‘SATVRMID_A‘, ‘SATMTMID_A‘, ‘DISTANCEONLY_A‘, ‘UGDS_A‘,
       ‘UGDS_WHITE_A‘, ‘UGDS_BLACK_A‘, ‘UGDS_HISP_A‘, ‘UGDS_ASIAN_A‘,
       ‘UGDS_AIAN_A‘, ‘UGDS_NHPI_A‘, ‘UGDS_2MOR_A‘, ‘UGDS_NRA_A‘,
       ‘UGDS_UNKN_A‘, ‘PPTUG_EF_A‘, ‘CURROPER_A‘, ‘PCTPELL_A‘, ‘PCTFLOAN_A‘,
       ‘UG25ABV_A‘, ‘MD_EARN_WNE_P10_A‘, ‘GRAD_DEBT_MDN_SUPP_A‘],
      dtype=‘object‘)

索引对象也可以通过比较运算符,得到布尔索引

columns > ‘G‘
array([ True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True])

1.2 集合方法

索引对象支持集合运算:联合、交叉、求差、对称差

c1 = columns[:4]
c2 = columns[2:5]

1.2.1 联合

c1.union(c2)
Index([‘CITY‘, ‘HBCU‘, ‘INSTNM‘, ‘MENONLY‘, ‘STABBR‘], dtype=‘object‘)
c1 | c2
Index([‘CITY‘, ‘HBCU‘, ‘INSTNM‘, ‘MENONLY‘, ‘STABBR‘], dtype=‘object‘)

1.2.2 对称差

c1.symmetric_difference(c2)
Index([‘CITY‘, ‘INSTNM‘, ‘MENONLY‘], dtype=‘object‘)
c1 ^ c2
Index([‘CITY‘, ‘INSTNM‘, ‘MENONLY‘], dtype=‘object‘)

2 笛卡尔积与索引爆炸

创建两个有不同索引、但包含一些相同值的Series

s1 = pd.Series(index=list(‘aaab‘), data=np.arange(4))
s2 = pd.Series(index=list(‘cababb‘), data=np.arange(6))

2.1 产生笛卡尔积

除非当两组索引元素完全相同、顺序也相同时,不会生成笛卡尔积,其它情况都会产生笛卡尔积

s1 + s2
a    1.0
a    3.0
a    2.0
a    4.0
a    3.0
a    5.0
b    5.0
b    7.0
b    8.0
c    NaN
dtype: float64

2.2 避免笛卡尔积

索引一致是,数据会按照它们的索引对齐。下面的例子,两个Series完全相同,结果也是整数

s1 = pd.Series(index=list(‘aaabb‘), data=np.arange(5))
s2 = pd.Series(index=list(‘aaabb‘), data=np.arange(5))
s1 + s2
a    0
a    2
a    4
b    6
b    8
dtype: int64

如果索引元素相同,但顺序不同,是能产生笛卡尔积的

 s1 = pd.Series(index=list(‘aaabb‘), data=np.arange(5))
 s2 = pd.Series(index=list(‘bbaaa‘), data=np.arange(5))
 s1 + s2
a    2
a    3
a    4
a    3
a    4
a    5
a    4
a    5
a    6
b    3
b    4
b    4
b    5
dtype: int64

2.2.1 索引爆炸

读取employee数据集,设定行索引是RACE

employee = pd.read_csv(‘data/employee.csv‘, index_col=‘RACE‘)

2.2.1.1 数据准备

copy方法 -- 复制数据,而非创造引用

salary1 = employee[‘BASE_SALARY‘]
salary2 = employee[‘BASE_SALARY‘]
salary1 is salary2
True

结果是True,表明二者指向的同一个对象。这意味着,如果修改一个,另一个也会去改变。为了收到一个全新的数据,使用copy方法

salary1 = employee[‘BASE_SALARY‘].copy()
salary2 = employee[‘BASE_SALARY‘].copy()
salary1 is salary2
False

2.2.1.2 索引顺序不一致

salary1 = salary1.sort_index()
salary_add1 = salary1 + salary2
salary_add1.shape
(1175424,)

2.2.1.3 索引完全一致

salary_add2 = salary1 + salary1

2.2.1.4 比较结果

len(salary1), len(salary2), len(salary_add1), len(salary_add2)
(2000, 2000, 1175424, 2000)

查看几个所得结果的长度,可以看到长度从2000到达了117万

2.2.1.5 验证结果

因为笛卡尔积是作用在相同索引元素上的,可以对其平方值求和

index_vc = salary2.index.value_counts(dropna=False)
index_vc
Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
NaN                                   35
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64
index_vc.pow(2).sum()
1175424

3 不等索引填充数值

3.1 不同索引的series

读取三个baseball数据集,行索引设为playerID

baseball_14 = pd.read_csv(‘data/baseball14.csv‘, index_col=‘playerID‘)
baseball_15 = pd.read_csv(‘data/baseball15.csv‘, index_col=‘playerID‘)
baseball_16 = pd.read_csv(‘data/baseball16.csv‘, index_col=‘playerID‘)
baseball_14.iloc[:5,:]

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

yearID stint teamID lgID G AB R H 2B 3B ... RBI SB CS BB SO IBB HBP SH SF GIDP
playerID
altuvjo01 2014 1 HOU AL 158 660 85 225 47 3 ... 59.0 56.0 9.0 36 53.0 7.0 5.0 1.0 5.0 20.0
cartech02 2014 1 HOU AL 145 507 68 115 21 1 ... 88.0 5.0 2.0 56 182.0 6.0 5.0 0.0 4.0 12.0
castrja01 2014 1 HOU AL 126 465 43 103 21 2 ... 56.0 1.0 0.0 34 151.0 1.0 9.0 1.0 3.0 11.0
corpoca01 2014 1 HOU AL 55 170 22 40 6 0 ... 19.0 0.0 0.0 14 37.0 0.0 3.0 1.0 2.0 3.0
dominma01 2014 1 HOU AL 157 564 51 121 17 0 ... 57.0 0.0 1.0 29 125.0 2.0 5.0 2.0 7.0 23.0

5 rows × 21 columns

3.1.1 索引分析

用索引方法difference,找到哪些索引标签在baseball_14中,却不在baseball_15、baseball_16中

baseball_14.index.difference(baseball_15.index)
Index([‘corpoca01‘, ‘dominma01‘, ‘fowlede01‘, ‘grossro01‘, ‘guzmaje01‘,
       ‘hoeslj01‘, ‘krausma01‘, ‘preslal01‘, ‘singljo02‘],
      dtype=‘object‘, name=‘playerID‘)
baseball_14.index.difference(baseball_16.index)
Index([‘cartech02‘, ‘corpoca01‘, ‘dominma01‘, ‘fowlede01‘, ‘grossro01‘,
       ‘guzmaje01‘, ‘hoeslj01‘, ‘krausma01‘, ‘preslal01‘, ‘singljo02‘,
       ‘villajo01‘],
      dtype=‘object‘, name=‘playerID‘)

3.1.2 数据分析

找到每名球员在过去三个赛季的击球数,H列包含了这个数据

hits_14 = baseball_14[‘H‘]
hits_15 = baseball_15[‘H‘]
hits_16 = baseball_16[‘H‘]

将hits_14和hits_15两列相加

(hits_14 + hits_15).head()
playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01      NaN
corpoca01      NaN
Name: H, dtype: float64

3.1.3 缺失值

3.1.3.1 可修补的缺失值

congeha01 和 corpoca01 在2015年是有记录的,但是结果缺失了。使用add方法和参数fill_value,避免产生缺失值

hits_14.add(hits_15, fill_value=0).head()
playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01     46.0
corpoca01     40.0
Name: H, dtype: float64

再将2016的数据也加上

hits_total = hits_14.add(hits_15, fill_value=0).add(hits_16, fill_value=0)
hits_total.head()
playerID
altuvjo01    641.0
bregmal01     53.0
cartech02    193.0
castrja01    243.0
congeha01     46.0
Name: H, dtype: float64

3.1.3.2 无法修补的缺失值

如果一个元素在两个Series都是缺失值,即便使用了fill_value,相加的结果也仍是缺失值

s = pd.Series(index=[‘a‘, ‘b‘, ‘c‘, ‘d‘], data=[np.nan, 3, np.nan, 1])
s1 = pd.Series(index=[‘a‘, ‘b‘, ‘c‘], data=[np.nan, 6, 10])
s.add(s1, fill_value=5)
a     NaN
b     9.0
c    15.0
d     6.0
dtype: float64

3.2 不同索引的dataframe

从baseball_14中选取一些列

df_14 = baseball_14[[‘G‘,‘AB‘, ‘R‘, ‘H‘]]
df_14.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

G AB R H
playerID
altuvjo01 158 660 85 225
cartech02 145 507 68 115
castrja01 126 465 43 103
corpoca01 55 170 22 40
dominma01 157 564 51 121

再从baseball_15中选取一些列,有相同的、也有不同的

df_15 = baseball_15[[‘AB‘, ‘R‘, ‘H‘, ‘HR‘]]
df_15.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

AB R H HR
playerID
altuvjo01 638 86 200 15
cartech02 391 50 78 24
castrja01 337 38 71 11
congeha01 201 25 46 11
correca01 387 52 108 22

将二者相加的话,只要行或列不能对齐,就会产生缺失值。style属性的highlight_null方法可以高亮缺失值

(df_14 + df_15).head(10).style.highlight_null(‘yellow‘)

#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow0_col1 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow0_col3 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow1_col1 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow1_col3 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow2_col1 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow2_col3 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow3_col0 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow3_col1 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow3_col2 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow3_col3 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow3_col4 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow4_col0 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow4_col1 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow4_col2 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow4_col3 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow4_col4 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow5_col0 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow5_col1 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow5_col2 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow5_col3 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow5_col4 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow6_col0 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow6_col1 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow6_col2 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow6_col3 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow6_col4 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow7_col0 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow7_col1 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow7_col2 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow7_col3 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow7_col4 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow8_col0 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow8_col1 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow8_col2 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow8_col3 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow8_col4 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow9_col0 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow9_col1 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow9_col2 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow9_col3 { background-color: yellow }
#T_3ce5088a_c893_11e8_89fb_88e9fe67745erow9_col4 { background-color: yellow }

AB G H HR R
playerID
altuvjo01 1298 nan 425 nan 171
cartech02 898 nan 193 nan 118
castrja01 802 nan 174 nan 81
congeha01 nan nan nan nan nan
corpoca01 nan nan nan nan nan
correca01 nan nan nan nan nan
dominma01 nan nan nan nan nan
fowlede01 nan nan nan nan nan
gattiev01 nan nan nan nan nan
gomezca01 nan nan nan nan nan

即便使用了fill_value=0,有些值也会是缺失值,这是因为一些行和列的组合根本不存在输入的数据中

df_14.add(df_15, fill_value=0).head(10).style.highlight_null(‘yellow‘)

#T_3ced4798_c893_11e8_a3f5_88e9fe67745erow3_col1 { background-color: yellow }
#T_3ced4798_c893_11e8_a3f5_88e9fe67745erow4_col3 { background-color: yellow }
#T_3ced4798_c893_11e8_a3f5_88e9fe67745erow5_col1 { background-color: yellow }
#T_3ced4798_c893_11e8_a3f5_88e9fe67745erow6_col3 { background-color: yellow }
#T_3ced4798_c893_11e8_a3f5_88e9fe67745erow7_col3 { background-color: yellow }
#T_3ced4798_c893_11e8_a3f5_88e9fe67745erow8_col1 { background-color: yellow }
#T_3ced4798_c893_11e8_a3f5_88e9fe67745erow9_col1 { background-color: yellow }

AB G H HR R
playerID
altuvjo01 1298 158 425 15 171
cartech02 898 145 193 24 118
castrja01 802 126 174 11 81
congeha01 201 nan 46 11 25
corpoca01 170 55 40 nan 22
correca01 387 nan 108 22 52
dominma01 564 157 121 nan 51
fowlede01 434 116 120 nan 61
gattiev01 566 nan 139 27 66
gomezca01 149 nan 36 4 19

4 从不同的DataFrame追加列

读取employee数据,选取‘DEPARTMENT‘, ‘BASE_SALARY‘这两列

employee = pd.read_csv(‘data/employee.csv‘)
dept_sal = employee[[‘DEPARTMENT‘, ‘BASE_SALARY‘]]

4.1 加入无重复索引的series

在每个部门内,对BASE_SALARY进行排序

dept_sal = dept_sal.sort_values([‘DEPARTMENT‘, ‘BASE_SALARY‘],ascending=[True, False])

用drop_duplicates方法保留每个部门的第一行

max_dept_sal = dept_sal.drop_duplicates(subset=‘DEPARTMENT‘)
max_dept_sal.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

DEPARTMENT BASE_SALARY
1494 Admn. & Regulatory Affairs 140416.0
149 City Controller‘s Office 64251.0
236 City Council 100000.0
647 Convention and Entertainment 38397.0
1500 Dept of Neighborhoods (DON) 89221.0

使用DEPARTMENT作为行索引

max_dept_sal = max_dept_sal.set_index(‘DEPARTMENT‘)
employee = employee.set_index(‘DEPARTMENT‘)

现在行索引包含匹配值了,可以向employee的DataFrame新增一列

employee[‘MAX_DEPT_SALARY‘] = max_dept_sal[‘BASE_SALARY‘]
employee.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

UNIQUE_ID POSITION_TITLE BASE_SALARY RACE EMPLOYMENT_TYPE GENDER EMPLOYMENT_STATUS HIRE_DATE JOB_DATE MAX_DEPT_SALARY
DEPARTMENT
Municipal Courts Department 0 ASSISTANT DIRECTOR (EX LVL) 121862.0 Hispanic/Latino Full Time Female Active 2006-06-12 2012-10-13 121862.0
Library 1 LIBRARY ASSISTANT 26125.0 Hispanic/Latino Full Time Female Active 2000-07-19 2010-09-18 107763.0
Houston Police Department-HPD 2 POLICE OFFICER 45279.0 White Full Time Male Active 2015-02-03 2015-02-03 199596.0
Houston Fire Department (HFD) 3 ENGINEER/OPERATOR 63166.0 White Full Time Male Active 1982-02-08 1991-05-25 210588.0
General Services Department 4 ELECTRICIAN 56347.0 White Full Time Male Active 1989-06-19 1994-10-22 89194.0

现在可以用query查看是否有BASE_SALARY大于MAX_DEPT_SALARY的

employee.query(‘BASE_SALARY > MAX_DEPT_SALARY‘)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

UNIQUE_ID POSITION_TITLE BASE_SALARY RACE EMPLOYMENT_TYPE GENDER EMPLOYMENT_STATUS HIRE_DATE JOB_DATE MAX_DEPT_SALARY
DEPARTMENT

4.2 加入有重复索引的series

用random从dept_sal随机取10行,不做替换

random_salary = dept_sal.sample(n=10).set_index(‘DEPARTMENT‘)
random_salary

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

BASE_SALARY
DEPARTMENT
Houston Police Department-HPD 86534.0
Fleet Management Department 49088.0
Houston Airport System (HAS) 76097.0
Houston Police Department-HPD 66614.0
Library 59748.0
Houston Airport System (HAS) 29286.0
Houston Police Department-HPD 61643.0
Houston Fire Department (HFD) 52644.0
Solid Waste Management 36712.0
Houston Police Department-HPD NaN

random_salary中是有重复索引的,employee DataFrame的标签要对应random_salary中的多个标签

新增RANDOM_SALARY列,会引起报错

# employee[‘RANDOM_SALARY‘] = random_salary[‘BASE_SALARY‘]

4.3 加入部分索引的series

选取max_dept_sal[‘BASE_SALARY‘]的前三行,赋值给employee[‘MAX_SALARY2‘]

employee[‘MAX_SALARY2‘] = max_dept_sal[‘BASE_SALARY‘].head(3)
employee.MAX_SALARY2.value_counts()
140416.0    29
100000.0    11
64251.0      5
Name: MAX_SALARY2, dtype: int64

因为只填充了三个部门的值,所有其它部门在结果中都是缺失值

employee.MAX_SALARY2.isnull().mean()
0.9775

原文地址:https://www.cnblogs.com/shiyushiyu/p/9745648.html

时间: 2024-07-31 10:18:34

Pandas Cookbook -- 06索引对齐的相关文章

十分钟入门pandas数据结构和索引

pandas数据结构和索引是入门pandas必学的内容,这里就详细给大家讲解一下,看完本篇文章,相信你对pandas数据结构和索引会有一个清晰的认识. 一.数据结构介绍 在pandas中有两类非常重要的数据结构,即序列Series和数据框DataFrame.Series类似于numpy中的一维数组,除了通吃一维数组可用的函数或方法,而且其可通过索引标签的方式获取数据,还具有索引的自动对齐功能:DataFrame类似于numpy中的二维数组,同样可以通用numpy数组的函数和方法,而且还具有其他灵

《Pandas CookBook》---- 第五章 布尔索引

布尔索引 简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧 import pandas as pd import numpy as np 设定最大列数和最大行数 pd.set_option('max_columns',5 , 'max_rows', 5) 1 布尔值统计信息 movie = pd.read_csv('data/movie.csv', index_col='movie_title') movie.head() .

Pandas | 08 重建索引

重新索引会更改DataFrame的行标签和列标签. 可以通过索引来实现多个操作: 重新排序现有数据以匹配一组新的标签. 在没有标签数据的标签位置插入缺失值(NA)标记. import pandas as pd import numpy as np N=20 df = pd.DataFrame({ 'A': pd.date_range(start='2016-01-01',periods=N,freq='D'), 'x': np.linspace(0,stop=N-1,num=N), 'y': n

MSSQL订阅库索引对齐

需求如下图: 在原来的架构中是每台web服务器都固定访问某一台数据库服务器,所以就造成了每台数据库订阅服务器上的索引不一致.现在的需求就是要把所有的订阅库上的索引调整为一致,为了就是实现高可用+负载均衡.原因是因为订阅库出现过硬盘故障,导致部分的应用无法访问了. 思路比较简单粗暴 1.把数据库中所有的索引信息提取出来 SELECT OBJECT_NAME(i.[object_id]) tblname , i.name Index_name , i.index_id , i.type_desc I

《Pandas CookBook》---- DataFrame基础操作

Pandas基础操作 简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧 import pandas as pd import numpy as np 设定最大列数和最大行数 pd.set_option('max_columns',5 , 'max_rows', 5) 选取多个DataFrame列 用列表选取多个列 movie = pd.read_csv('data/movie.csv') cols =['actor_1_nam

Pandas Cookbook -- 07 分组聚合、过滤、转换

分组聚合.过滤.转换 简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧 import pandas as pd import numpy as np 设定最大列数和最大行数 pd.set_option('max_columns',8 , 'max_rows', 8) 1 聚合 读取flights数据集,查询头部 flights = pd.read_csv('data/flights.csv') flights.head() .

Pandas Cookbook -- 08数据清理

数据清理 简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧 import pandas as pd import numpy as np 设定最大列数和最大行数 pd.set_option('max_columns',5 , 'max_rows', 5) 1 宽格式转长格式 state_fruit = pd.read_csv('data/state_fruit.csv', index_col=0) state_fruit .d

pandas 之 时间序列索引

import numpy as np import pandas as pd 引入 A basic kind of time series object in pandas is a Series indexed by timestamps, which is often represented external to pandas as Python string or datetime objects: from datetime import datetime dates = [ date

pandas基础--层次化索引

以下代码的前提:import pandas as pd 层次化索引使你能在一个轴上拥有多个(两个以上)索引级别.抽象的说,它使你能以低维度形式处理高维度数据. 1 >>> data = pd.Series(np.random.randn(10), index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]]) 2 >>> data 3 a 1 3.23