《Pandas CookBook》---- 第五章 布尔索引

布尔索引

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

设定最大列数和最大行数

pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)

1 布尔值统计信息

movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)
movie.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

color director_name ... aspect_ratio movie_facebook_likes
movie_title
Avatar Color James Cameron ... 1.78 33000
Pirates of the Caribbean: At World‘s End Color Gore Verbinski ... 2.35 0
Spectre Color Sam Mendes ... 2.35 85000
The Dark Knight Rises Color Christopher Nolan ... 2.35 164000
Star Wars: Episode VII - The Force Awakens NaN Doug Walker ... NaN 0

5 rows × 27 columns

1.1 基础方法

判断电影时长是否超过两小时

movie_2_hours = movie[‘duration‘] > 120
movie_2_hours.head(10)
movie_title
Avatar                                      True
Pirates of the Caribbean: At World‘s End    True
                                            ...
Avengers: Age of Ultron                     True
Harry Potter and the Half-Blood Prince      True
Name: duration, Length: 10, dtype: bool

有多少时长超过两小时的电影

movie_2_hours.sum()
1039

超过两小时的电影的比例

movie_2_hours.mean()
0.2113506916192026

实际上,dureation这列是有缺失值的,要想获得真正的超过两小时的电影的比例,需要先删掉缺失值

movie[‘duration‘].dropna().gt(120).mean()
0.21199755152009794

1.2 统计信息

用describe()输出一些该布尔Series信息

movie_2_hours.describe()
count      4916
unique        2
top       False
freq       3877
Name: duration, dtype: object

统计False和True值的比例

 movie_2_hours.value_counts(normalize=True)
False    0.788649
True     0.211351
Name: duration, dtype: float64

2 布尔索引

2.1 布尔条件

在Pandas中,位运算符(&, |, ~)的优先级高于比较运算符

2.1.1 创建多个布尔条件

criteria1 = movie.imdb_score > 8
criteria2 = movie.content_rating == ‘PG-13‘
criteria3 = (movie.title_year < 2000) | (movie.title_year >= 2010)
criteria3.head()
movie_title
Avatar                                        False
Pirates of the Caribbean: At World‘s End      False
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
Name: title_year, dtype: bool

2.1.2 将这些布尔条件合并成一个

criteria_final = criteria1 & criteria2 & criteria3
criteria_final.head()
movie_title
Avatar                                        False
Pirates of the Caribbean: At World‘s End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

2.2 布尔过滤

创建第一个布尔条件

 crit_a1 = movie.imdb_score > 8
 crit_a2 = movie.content_rating == ‘PG-13‘
 crit_a3 = (movie.title_year < 2000) | (movie.title_year > 2009)
 final_crit_a = crit_a1 & crit_a2 & crit_a3

创建第二个布尔条件

crit_b1 = movie.imdb_score < 5
crit_b2 = movie.content_rating == ‘R‘
crit_b3 = (movie.title_year >= 2000) & (movie.title_year <= 2010)
final_crit_b = crit_b1 & crit_b2 & crit_b3

合并布尔条件

final_crit_all = final_crit_a | final_crit_b
final_crit_all.head()
movie_title
Avatar                                        False
Pirates of the Caribbean: At World‘s End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

过滤数据

movie[final_crit_all].head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

color director_name ... aspect_ratio movie_facebook_likes
movie_title
The Dark Knight Rises Color Christopher Nolan ... 2.35 164000
The Avengers Color Joss Whedon ... 1.85 123000
Captain America: Civil War Color Anthony Russo ... 2.35 72000
Guardians of the Galaxy Color James Gunn ... 2.35 96000
Interstellar Color Christopher Nolan ... 2.35 349000

5 rows × 27 columns

验证过滤

cols = [‘imdb_score‘, ‘content_rating‘, ‘title_year‘]
movie_filtered = movie.loc[final_crit_all, cols]
movie_filtered.head(10)

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

imdb_score content_rating title_year
movie_title
The Dark Knight Rises 8.5 PG-13 2012.0
The Avengers 8.1 PG-13 2012.0
... ... ... ...
Sex and the City 2 4.3 R 2010.0
Rollerball 3.0 R 2002.0

10 rows × 3 columns

2.3 与标签索引对比

college = pd.read_csv(‘data/college.csv‘)
college2 = college.set_index(‘STABBR‘)

2.3.1 单个标签

college2中STABBR作为行索引,用loc选取

college2.loc[‘TX‘].head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
STABBR
TX Abilene Christian University Abilene ... 40200 25985
TX Alvin Community College Alvin ... 34500 6750
TX Amarillo College Amarillo ... 31700 10950
TX Angelina College Lufkin ... 26900 PrivacySuppressed
TX Angelo State University San Angelo ... 37700 21319.5

5 rows × 26 columns

college中,用布尔索引选取所有得克萨斯州的学校

college[college[‘STABBR‘] == ‘TX‘].head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
3610 Abilene Christian University Abilene ... 40200 25985
3611 Alvin Community College Alvin ... 34500 6750
3612 Amarillo College Amarillo ... 31700 10950
3613 Angelina College Lufkin ... 26900 PrivacySuppressed
3614 Angelo State University San Angelo ... 37700 21319.5

5 rows × 27 columns

比较二者的速度

法一

%timeit college[college[‘STABBR‘] == ‘TX‘]
937 μs ± 58.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

法二

%timeit college2.loc[‘TX‘]
520 μs ± 21.2 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit college2 = college.set_index(‘STABBR‘)
2.11 ms ± 185 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2.3.2 多个标签

布尔索引和标签选取多列

states =[‘TX‘, ‘CA‘, ‘NY‘]
college[college[‘STABBR‘].isin(states)]

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
192 Academy of Art University San Francisco ... 36000 35093
193 ITT Technical Institute-Rancho Cordova Rancho Cordova ... 38800 25827.5
... ... ... ... ... ...
7533 Bay Area Medical Academy - San Jose Satellite ... San Jose ... NaN PrivacySuppressed
7534 Excel Learning Center-San Antonio South San Antonio ... NaN 12125

1704 rows × 27 columns

college2.loc[states].head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
STABBR
TX Abilene Christian University Abilene ... 40200 25985
TX Alvin Community College Alvin ... 34500 6750
TX Amarillo College Amarillo ... 31700 10950
TX Angelina College Lufkin ... 26900 PrivacySuppressed
TX Angelo State University San Angelo ... 37700 21319.5

5 rows × 26 columns

3 查询方法

使用查询方法提高布尔索引的可读性

# 读取employee数据,确定选取的部门和列
employee = pd.read_csv(‘data/employee.csv‘)
depts = [‘Houston Police Department-HPD‘, ‘Houston Fire Department (HFD)‘]
select_columns = [‘UNIQUE_ID‘, ‘DEPARTMENT‘, ‘GENDER‘, ‘BASE_SALARY‘]
# 创建查询字符串,并执行query方法
qs = "DEPARTMENT in @depts and GENDER == ‘Female‘ and 80000 <= BASE_SALARY <= 120000"
emp_filtered = employee.query(qs)
emp_filtered[select_columns].head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

UNIQUE_ID DEPARTMENT GENDER BASE_SALARY
61 61 Houston Fire Department (HFD) Female 96668.0
136 136 Houston Police Department-HPD Female 81239.0
367 367 Houston Police Department-HPD Female 86534.0
474 474 Houston Police Department-HPD Female 91181.0
513 513 Houston Police Department-HPD Female 81239.0

4 唯一和有序索引

4.1 单列索引

college = pd.read_csv(‘data/college.csv‘)
college2 = college.set_index(‘STABBR‘)
college2.index.is_monotonic
False

将college2排序,存储成另一个对象,查看其是否有序

college3 = college2.sort_index()
college3.index.is_monotonic
True

使用INSTNM作为行索引,检测行索引是否唯一

college_unique = college.set_index(‘INSTNM‘)
college_unique.index.is_unique
True

4.2 拼装索引

使用CITY和STABBR两列作为行索引,并进行排序

college.index = college[‘CITY‘] + ‘, ‘ + college[‘STABBR‘]
college = college.sort_index()
college.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
ARTESIA, CA Angeles Institute ARTESIA ... NaN 16850
Aberdeen, SD Presentation College Aberdeen ... 35900 25000
Aberdeen, SD Northern State University Aberdeen ... 33600 24847
Aberdeen, WA Grays Harbor College Aberdeen ... 27000 11490
Abilene, TX Hardin-Simmons University Abilene ... 38700 25864

5 rows × 27 columns

college.index.is_unique
False

选取所有Miami, FL的大学

法一

college.loc[‘Miami, FL‘].head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
Miami, FL New Professions Technical Institute Miami ... 18700 8682
Miami, FL Management Resources College Miami ... PrivacySuppressed 12182
Miami, FL Strayer University-Doral Miami ... 49200 36173.5
Miami, FL Keiser University- Miami Miami ... 29700 26063
Miami, FL George T Baker Aviation Technical College Miami ... 38600 PrivacySuppressed

5 rows × 27 columns

法二

crit1 = college[‘CITY‘] == ‘Miami‘
crit2 = college[‘STABBR‘] == ‘FL‘
college[crit1 & crit2]

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
Miami, FL New Professions Technical Institute Miami ... 18700 8682
Miami, FL Management Resources College Miami ... PrivacySuppressed 12182
... ... ... ... ... ...
Miami, FL Advanced Technical Centers Miami ... PrivacySuppressed PrivacySuppressed
Miami, FL Lindsey Hopkins Technical College Miami ... 29800 PrivacySuppressed

50 rows × 27 columns

5 loc/iloc中使用布尔

movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)

5.1 行

c1 = movie[‘content_rating‘] == ‘G‘
c2 = movie[‘imdb_score‘] < 4
criteria = c1 & c2
bool_movie = movie[criteria]
bool_movie

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

color director_name ... aspect_ratio movie_facebook_likes
movie_title
The True Story of Puss‘N Boots Color Jér?me Deschamps ... NaN 90
Doogal Color Dave Borthwick ... 1.85 346
... ... ... ... ... ...
Justin Bieber: Never Say Never Color Jon M. Chu ... 1.85 62000
Sunday School Musical Color Rachel Goldenberg ... 1.85 777

6 rows × 27 columns

loc使用bool

法一

movie_loc = movie.loc[criteria]

检查loc条件和布尔条件创建出来的两个DataFrame是否一样

movie_loc.equals(movie[criteria])
True

法二

movie_loc2 = movie.loc[criteria.values]
movie_loc2.equals(movie[criteria])
True

iloc使用bool

因为criteria是包含行索引的一个Series,必须要使用底层的ndarray,才能使用,iloc

movie_iloc = movie.iloc[criteria.values]
movie_iloc.equals(movie_loc)
True

5.2 列

布尔索引也可以用来选取列

criteria_col = movie.dtypes == np.int64
criteria_col.head()
color                      False
director_name              False
num_critic_for_reviews     False
duration                   False
director_facebook_likes    False
dtype: bool
movie.loc[:, criteria_col].head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

num_voted_users cast_total_facebook_likes movie_facebook_likes
movie_title
Avatar 886204 4834 33000
Pirates of the Caribbean: At World‘s End 471220 48350 0
Spectre 275868 11700 85000
The Dark Knight Rises 1144337 106759 164000
Star Wars: Episode VII - The Force Awakens 8 143 0
movie.iloc[:, criteria_col.values].head()

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

num_voted_users cast_total_facebook_likes movie_facebook_likes
movie_title
Avatar 886204 4834 33000
Pirates of the Caribbean: At World‘s End 471220 48350 0
Spectre 275868 11700 85000
The Dark Knight Rises 1144337 106759 164000
Star Wars: Episode VII - The Force Awakens 8 143 0

6 使用布尔值 - where/mask

mask() is the inverse boolean operation of where.

DataFrame.where(cond, other=nan, inplace=False **kwgs)

Parameters:

  • cond : boolean NDFrame, array-like, or callable

    • Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
    • cond是一个与df通型的dataframe,当dataframe与cond对应的位置是true是,保留原值。否则便为other对应的值
  • other : scalar, NDFrame, or callable
  • inplace : boolean, default False
    • Whether to perform the operation in place on the data

6.1 Series使用where

movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)
fb_likes = movie[‘actor_1_facebook_likes‘].dropna()
fb_likes.head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World‘s End      40000.0
Spectre                                       11000.0
The Dark Knight Rises                         27000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

使用describe获得对数据的认知

fb_likes.describe(percentiles=[.1, .25, .5, .75, .9]).astype(int)
count      4909
mean       6494
          ...
90%       18000
max      640000
Name: actor_1_facebook_likes, Length: 10, dtype: int64

检测小于20000个喜欢的的比例

criteria_high = fb_likes < 20000
criteria_high.mean().round(2)
0.91

where条件可以返回一个同样大小的Series,但是所有False会被替换成缺失值

fb_likes.where(criteria_high).head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World‘s End          NaN
Spectre                                       11000.0
The Dark Knight Rises                             NaN
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

第二个参数other,可以让你控制替换值

fb_likes.where(criteria_high, other=20000).head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World‘s End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

通过where条件,设定上下限的值

criteria_low = fb_likes > 300
fb_likes_cap = fb_likes.where(criteria_high, other=20000).where(criteria_low, 300)
fb_likes_cap.head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World‘s End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      300.0
Name: actor_1_facebook_likes, dtype: float64

原始Series和修改过的Series的长度是一样的

len(fb_likes), len(fb_likes_cap)
(4909, 4909)

6.2 dataframe使用where

df = pd.DataFrame({‘vals‘: [1, 2, 3, 4], ‘ids‘: [‘a‘, ‘b‘, ‘f‘, ‘n‘],‘ids2‘: [‘a‘, ‘n‘, ‘c‘, ‘n‘]})
print(df)
print(df < 2)
df.where(df<2,1000)
   vals ids ids2
0     1   a    a
1     2   b    n
2     3   f    c
3     4   n    n
    vals   ids  ids2
0   True  True  True
1  False  True  True
2  False  True  True
3  False  True  True

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

vals ids ids2
0 1 a a
1 1000 b n
2 1000 f c
3 1000 n n

下面的代码等价于 df.where(df < 0,1000).

print(df[df < 2])
df[df < 2].fillna(1000)
   vals ids ids2
0   1.0   a    a
1   NaN   b    n
2   NaN   f    c
3   NaN   n    n

.dataframe tbody tr th:only-of-type { vertical-align: middle }
.dataframe tbody tr th { vertical-align: top }
.dataframe thead th { text-align: right }

vals ids ids2
0 1.0 a a
1 1000.0 b n
2 1000.0 f c
3 1000.0 n n

原文地址:https://www.cnblogs.com/shiyushiyu/p/9742808.html

时间: 2024-09-30 13:51:52

《Pandas CookBook》---- 第五章 布尔索引的相关文章

第五章 MySQL事务,视图,索引,备份和恢复

第五章 MySQL事务,视图,索引,备份和恢复 一.事务 1.什么是事务 事务是一种机制,一个操作序列,它包含了一组数据库操作命令,并且把所有的命令作为一个整体一起向系统提交或撤销操作请求.要么都执行,要么都不执行. 事务是作为最小的控制单元来使用的,特别使用与多用户同时操作的数据库系统. 2.为什么需要事务 事务(transaction)是指将一系列数据操作捆绑成为一个整体进行统一管理,如果某一事务执行成功,则在该事务中进行的所有数据均会提交,成为数据库中永久的组成部分.如果事务执行遇到错误且

Pandas Cookbook -- 06索引对齐

索引对齐 简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧 import pandas as pd import numpy as np 1 索引方法 college = pd.read_csv('data/college.csv') college.iloc[:5,:5] .dataframe tbody tr th:only-of-type { vertical-align: middle } .dataframe tbo

JavaScript高级程序设计(第3版)第五章读书笔记

第五章 引用类型 创建Object实例的方式有两种,第一种是使用new操作符后跟Object构造函数,例如: var person = new Object( ); person.name = “Nicholas”; person.age=29; 第二种是使用对象字面量表示法.如: var person = { name : “Nicholas”, age : 29 }; 在最后一个属性后面添加逗号,会在IE7及更早版本和Opera中导致错误. 两种访问对象属性的方法,一是方括号语法,例如per

javascript高级程序设计 学习笔记 第五章 上

第五章 引用类型的值(对象)是引用类型的一个实例.在 ECMAScript 中,引用类型是一种数据结构, 用于将数据和功能组织在一起.它也常被称为类,但这种称呼并不妥当.尽管 ECMAScript 从技术上讲是一门面向对象的语言,但它不具备传统的面向对象语言所支持的类和接口等基本结构.引用类型有时候也被称为对象定义,因为它们描述的是一类对象所具有的属性和方法. 对象是某个特定引用类型的实例.新对象是使用 new 操作符后跟一个构造函数来创建的. 构造函数本身就是一个函数,只不过该函数是出于创建新

Python基础教程(第五章 条件、循环和其他语句)

本文内容全部出自<Python基础教程>第二版,在此分享自己的学习之路. ______欢迎转载:http://www.cnblogs.com/Marlowes/p/5329066.html______ Created on Xu Hoo 读者学到这里估计都有点不耐烦了.好吧,这些数据结构什么的看起来都挺好,但还是没法用它们做什么事,对吧? 下面开始,进度会慢慢加快.前面已经介绍过了几种基本语句(print语句.import语句.赋值语句).在深入介绍条件语句和循环语句之前,我们先来看看这几种基

OpenGL ES着色器语言之操作数(官方文档第五章)

OpenGL ES着色器语言之操作数(官方文档第五章) 5.1操作数 OpenGL ES着色器语言包含如下操作符. 5.2数组下标 数组元素通过数组下标操作符([ ])进行访问.这是操作数组的唯一操作符,举个访问数组元素的例子: diffuseColor += lightIntensity[3] * NdotL; 5.3函数调用 如果一个函数有返回值,那么通常这个函数调用会用在表达式中. 5.4构造器 构造器使用函数调用语法,函数名是一个基本类型的关键字或者结构体名字,在初始化器或表达式中使用.

《javascript高级程序设计》第五章知识点总结

第五章知识点总结 1.object类型 访问对象的方法:①点表示法        (people.name) :      ②方括号表示法         (people[name]). 常用方法:hasOwnProperty()         用于检查给定属性在当前对象实例中是否存在 isPrototypeOf()              用于检测传入的对象是否传入对象原型 toString()                        返回对象的字符串表示 valueOf()    

第五章函数

第五章 函数 5.1 函数的本质及应用场景 截至目前:面向过程编程(可读性差/可重用性差) 对于函数编程: 本质:将N行代码拿到别处并给他起一个名字,以后通过名字就可以找到这段代码并执行 应用场景: 代码重复执行 代码特别多超过一屏,可以选择通过函数进行代码的分割 # 面向过程编程 user_input = input('请输入角色:') if user_input == '管理员': import smtplib from email.mime.text import MIMEText fro

深入浅出Zabbix 3.0 -- 第十五章 Zabbix 协议与API

今天是六.一儿童节,祝小朋友们节日快乐!发完此文就带我家小朋友出去玩耍了. 第十五章 Zabbix 协议与API 本章将介绍和开发相关的Zabbix协议和API的内容,通过对Zabbix协议和API的深入了解,你可以利用Zabbix协议编写客户端程序并将其嵌入的产品或系统中,并将数据发送到Zabbix server,这在无法安装Zabbixagent等程序的场景(例如专用的嵌入式系统)中非常有用.你也可以利用Zabbix API开发自己的业务系统,或灵活的与现有系统整合集成. 15.1 Zabb