数据丢失(缺失)在现实生活中总是一个问题。 机器学习和数据挖掘等领域由于数据缺失导致的数据质量差,在模型预测的准确性上面临着严重的问题。 在这些领域,缺失值处理是使模型更加准确和有效的重点。
使用重构索引(reindexing),创建了一个缺少值的DataFrame。 在输出中,NaN
表示不是数字的值。
一、检查缺失值
为了更容易地检测缺失值(以及跨越不同的数组dtype
),Pandas提供了isnull()
和notnull()
函数,它们也是Series和DataFrame对象的方法
示例1
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[‘a‘, ‘c‘, ‘e‘, ‘f‘,‘h‘], columns=[‘one‘, ‘two‘, ‘three‘]) df = df.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘, ‘f‘, ‘g‘, ‘h‘]) print(df) print(‘\n‘) print (df[‘one‘].isnull())
输出结果:
one two threea 0.036297 -0.615260 -1.341327b NaN NaN NaNc -1.908168 -0.779304 0.212467d NaN NaN NaNe 0.527409 -2.432343 0.190436f 1.428975 -0.364970 1.084148g NaN NaN NaNh 0.763328 -0.818729 0.240498 a Falseb Truec Falsed Truee Falsef Falseg Trueh FalseName: one, dtype: bool
示例2
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[‘a‘, ‘c‘, ‘e‘, ‘f‘, ‘h‘],columns=[‘one‘, ‘two‘, ‘three‘]) df = df.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘, ‘f‘, ‘g‘, ‘h‘]) print (df[‘one‘].notnull())
输出结果:
a Trueb Falsec Trued Falsee Truef Trueg Falseh TrueName: one, dtype: bool
二、缺少数据的计算
- 在求和数据时,
NA
将被视为0
- 如果数据全部是
NA
,那么结果将是NA
实例1
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[‘a‘, ‘c‘, ‘e‘, ‘f‘, ‘h‘],columns=[‘one‘, ‘two‘, ‘three‘]) df = df.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘, ‘f‘, ‘g‘, ‘h‘]) print(df) print(‘\n‘) print (df[‘one‘].sum())
输出结果:
one two threea -1.191036 0.945107 -0.806292b NaN NaN NaNc 0.127794 -1.812588 -0.466076d NaN NaN NaNe 2.358568 0.559081 1.486490f -0.242589 0.574916 -0.831853g NaN NaN NaNh -0.328030 1.815404 -1.706736 0.7247067964060545
示例2
import pandas as pd df = pd.DataFrame(index=[0,1,2,3,4,5],columns=[‘one‘,‘two‘]) print(df) print(‘\n‘) print (df[‘one‘].sum())
输出结果:
one two0 NaN NaN1 NaN NaN2 NaN NaN3 NaN NaN4 NaN NaN5 NaN NaN 0
三、填充缺少数据
Pandas提供了各种方法来清除缺失的值。fillna()
函数可以通过几种方法用非空数据“填充”NA
值。
用标量值替换NaN
以下程序显示如何用0
替换NaN
。
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(3, 3), index=[‘a‘, ‘c‘, ‘e‘],columns=[‘one‘,‘two‘, ‘three‘]) df = df.reindex([‘a‘, ‘b‘, ‘c‘]) print (df) print(‘\n‘) print ("NaN replaced with ‘0‘:") print (df.fillna(0))
输出结果:
one two three
a -0.479425 -1.711840 -1.453384
b NaN NaN NaN
c -0.733606 -0.813315 0.476788
NaN replaced with ‘0‘:
one two three
a -0.479425 -1.711840 -1.453384
b 0.000000 0.000000 0.000000
c -0.733606 -0.813315 0.476788
在这里填充零值; 当然,也可以填写任何其他的值。
替换丢失(或)通用值
很多时候,必须用一些具体的值取代一个通用的值。可以通过应用替换方法来实现这一点。用标量值替换NA
是fillna()
函数的等效行为。
示例
import pandas as pd df = pd.DataFrame({‘one‘:[10,20,30,40,50,2000],‘two‘:[1000,0,30,40,50,60]}) print(df) print(‘\n‘) print (df.replace({1000:10,2000:60}))
输出结果:
one two0 10 10001 20 02 30 303 40 404 50 505 2000 60 one two0 10 101 20 02 30 303 40 404 50 505 60 60
填写NA前进和后退
使用重构索引章节讨论的填充概念,来填补缺失的值。
方法 | 动作 |
---|---|
pad/fill |
填充方法向前 |
bfill/backfill |
填充方法向后 |
示例1
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[‘a‘, ‘c‘, ‘e‘, ‘f‘, ‘h‘],columns=[‘one‘, ‘two‘, ‘three‘]) df = df.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘, ‘f‘, ‘g‘, ‘h‘]) print(df) print(‘\n‘) print (df.fillna(method=‘pad‘))
输出结果:
one two threea -0.023243 1.671621 -1.687063b NaN NaN NaNc -0.933355 0.609602 -0.620189d NaN NaN NaNe 0.151455 -1.324563 -0.598897f 0.605670 -0.924828 -1.050643g NaN NaN NaNh 0.892414 -0.137194 -1.101791 one two threea -0.023243 1.671621 -1.687063b -0.023243 1.671621 -1.687063c -0.933355 0.609602 -0.620189d -0.933355 0.609602 -0.620189e 0.151455 -1.324563 -0.598897f 0.605670 -0.924828 -1.050643g 0.605670 -0.924828 -1.050643h 0.892414 -0.137194 -1.101791
示例2
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[‘a‘, ‘c‘, ‘e‘, ‘f‘, ‘h‘],columns=[‘one‘, ‘two‘, ‘three‘]) df = df.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘, ‘f‘, ‘g‘, ‘h‘]) print (df.fillna(method=‘backfill‘))
输出结果:
one two three
a 2.278454 1.550483 -2.103731
b -0.779530 0.408493 1.247796
c -0.779530 0.408493 1.247796
d 0.262713 -1.073215 0.129808
e 0.262713 -1.073215 0.129808
f -0.600729 1.310515 -0.877586
g 0.395212 0.219146 -0.175024
h 0.395212 0.219146 -0.175024
四、丢失缺少的值
使用dropna
函数和axis
参数。 默认情况下,axis = 0
,即在行上应用,这意味着如果行内的任何值是NA
,那么整个行被排除。
实例1
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[‘a‘, ‘c‘, ‘e‘, ‘f‘,‘h‘],columns=[‘one‘, ‘two‘, ‘three‘]) df = df.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘, ‘f‘, ‘g‘, ‘h‘]) print (df.dropna())
输出结果 :
one two three
a -0.719623 0.028103 -1.093178
c 0.040312 1.729596 0.451805
e -1.029418 1.920933 1.289485
f 1.217967 1.368064 0.527406
h 0.667855 0.147989 -1.035978
示例2
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=[‘a‘, ‘c‘, ‘e‘, ‘f‘, ‘h‘],columns=[‘one‘, ‘two‘, ‘three‘]) df = df.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘, ‘f‘, ‘g‘, ‘h‘]) print (df.dropna(axis=1))
输出结果:
Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]
原文地址:https://www.cnblogs.com/Summer-skr--blog/p/11705887.html
时间: 2024-10-09 19:24:58