python data analysis | python数据预处理(基于scikit-learn模块)

原文:http://www.jianshu.com/p/94516a58314d

  • Dataset transformations| 数据转换
  • Combining estimators|组合学习器
  • Feature extration|特征提取
  • Preprocessing data|数据预处理


1 Dataset transformations


scikit-learn provides a library of transformers, which may clean (see Preprocessing data), reduce (see Unsupervised dimensionality reduction), expand (see Kernel Approximation) or generate (see Feature extraction) feature representations.

scikit-learn 提供了数据转换的模块,包括数据清理、降维、扩展和特征提取。

Like other estimators, these are represented by classes with fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.

scikit-learn模块有3种通用的方法:fit(X,y=None)、transform(X)、fit_transform(X)、inverse_transform(newX)。fit用来训练模型;transform在训练后用来降维;fit_transform先用训练模型,然后返回降维后的X;inverse_transform用来将降维后的数据转换成原始数据

1.1 combining estimators

  • 1.1.1 Pipeline:chaining estimators

    Pipeline 模块是用来组合一系列估计器的。对固定的一系列操作非常便利,如:同时结合特征选择、数据标准化、分类。

    • Usage|使用
      代码:

      from sklearn.pipeline import Pipeline
      from sklearn.svm import SVC
      from sklearn.decomposition import PCA
      from sklearn.pipeline import make_pipeline
      #define estimators
      #the arg is a list of (key,value) pairs,where the key is a string you want to give this step and value is an estimators object
      estimators=[(‘reduce_dim‘,PCA()),(‘svm‘,SVC())]
      #combine estimators
      clf1=Pipeline(estimators)
      clf2=make_pipeline(PCA(),SVC())  #use func make_pipeline() can do the same thing
      print(clf1,‘\n‘,clf2)

      输出:

      Pipeline(steps=[(‘reduce_dim‘, PCA(copy=True, n_components=None, whiten=False)), (‘svm‘,           SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape=None, degree=3, gamma=‘auto‘, kernel=‘rbf‘,
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False))])
      Pipeline(steps=[(‘pca‘, PCA(copy=True, n_components=None, whiten=False)), (‘svc‘, SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape=None, degree=3, gamma=‘auto‘, kernel=‘rbf‘,
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False))])

      可以通过set_params()方法设置学习器的属性,参数形式为<estimator>_<parameter>

      clf.set_params(svm__C=10)

      上面的方法在网格搜索时很重要

      from sklearn.grid_search import GridSearchCV
      params = dict(reduce_dim__n_components=[2, 5, 10],svm__C=[0.1, 10, 100])
      grid_search = GridSearchCV(clf, param_grid=params)

      上面的例子相当于把pipeline生成的学习器作为一个普通的学习器,参数形式为<estimator>_<parameter>。

    • Note|说明
      1.可以使用dir()函数查看clf的所有属性和方法。例如step属性就是每个操作步骤的属性。

      >>> clf.steps[0]
      (‘reduce_dim‘, PCA(copy=True, n_components=None, whiten=False))

      2.调用pipeline生成的学习器的fit方法相当于依次调用其包含的所有学习器的方法,transform输入然后把结果扔向下一步骤。pipeline生成的学习器有着它包含的学习器的所有方法。如果最后一个学习器是分类,那么生成的学习器就是分类,如果最后一个是transform,那么生成的学习器就是transform,依次类推。
  • 1.1.2 FeatureUnion: composite feature spaces

    与pipeline不同的是FeatureUnion只组合transformer,它们也可以结合成更复杂的模型。

    FeatureUnion combines several transformer objects into a new transformer that combines their output. AFeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently. For transforming data, the transformers are applied in parallel, and the sample vectors they output are concatenated end-to-end into larger vectors.

    • Usage|使用
      代码:

      from sklearn.pipeline import FeatureUnion
      from sklearn.decomposition import PCA
      from sklearn.decomposition import KernelPCA
      from sklearn.pipeline import make_union
      #define transformers
      #the arg is a list of (key,value) pairs,where the key is a string you want to give this step and value is an transformer object
      estimators=[(‘linear_pca)‘,PCA()),(‘Kernel_pca‘,KernelPCA())]
      #combine transformers
      clf1=FeatureUnion(estimators)
      clf2=make_union(PCA(),KernelPCA())
      print(clf1,‘\n‘,clf2)
      print(dir(clf1))

      输出:

      FeatureUnion(n_jobs=1,
         transformer_list=[(‘linear_pca)‘, PCA(copy=True, n_components=None, whiten=False)), (‘Kernel_pca‘, KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver=‘auto‘,
       fit_inverse_transform=False, gamma=None, kernel=‘linear‘,
       kernel_params=None, max_iter=None, n_components=None,
       remove_zero_eig=False, tol=0))],
         transformer_weights=None)
      FeatureUnion(n_jobs=1,
         transformer_list=[(‘pca‘, PCA(copy=True, n_components=None, whiten=False)), (‘kernelpca‘, KernelPCA(alpha=1.0, coef0=1, degree=3, eigen_solver=‘auto‘,
       fit_inverse_transform=False, gamma=None, kernel=‘linear‘,
       kernel_params=None, max_iter=None, n_components=None,
       remove_zero_eig=False, tol=0))],
         transformer_weights=None)

      可以看出FeatureUnion的用法与pipeline一致

    • Note|说明

      (A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)

      Here is a example python source code:feature_stacker.py

1.2 Feature extraction

The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

skilearn.feature_extraction模块是用机器学习算法所支持的数据格式来提取数据,如将text和image信息转换成dataset。
Note:
Feature extraction(特征提取)与Feature selection(特征选择)不同,前者是用来将非数值的数据转换成数值的数据,后者是用机器学习的方法对特征进行学习(如PCA降维)。

  • 1.2.1 Loading features from dicts

    The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict
    objects to the NumPy/SciPy representation used by scikit-learn estimators.
    Dictvectorizer类用来将python内置的dict类型转换成数值型的array。dict类型的好处是在存储稀疏数据时不用存储无用的值。

    代码:

    measurements=[{‘city‘: ‘Dubai‘, ‘temperature‘: 33.}
    ,{‘city‘: ‘London‘, ‘temperature‘:12.}
    ,{‘city‘:‘San Fransisco‘,‘temperature‘:18.},]
    from sklearn.feature_extraction import DictVectorizer
    vec=DictVectorizer()
    x=vec.fit_transform(measurements).toarray()
    print(x)
    print(vec.get_feature_names())

    输出:

    [[  1.   0.   0.  33.]
    [  0.   1.   0.  12.]
    [  0.   0.   1.  18.]]
    [‘city=Dubai‘, ‘city=London‘, ‘city=San Fransisco‘, ‘temperature‘]
    [Finished in 0.8s]
  • 1.2.2 Feature hashing
  • 1.2.3 Text feature extraction
  • 1.2.4 Image feature extraction

    以上三小节暂未考虑(设计到语言处理及图像处理)见官方文档

1.3 Preprogressing data

The sklearn.preprocessing
package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators

sklearn.preprogressing模块提供了几种常见的数据转换,如标准化、归一化等。

  • 1.3.1 Standardization, or mean removal and variance scaling

    Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

    很多学习算法都要求事先对数据进行标准化,如果不是像标准正太分布一样0均值1方差就可能会有很差的表现。

    • Usage|用法

    代码:

    from sklearn import preprocessing
    import numpy as np
    X = np.array([[1.,-1., 2.], [2.,0.,0.], [0.,1.,-1.]])
    Y=X
    Y_scaled = preprocessing.scale(Y)
    y_mean=Y_scaled.mean(axis=0) #If 0, independently standardize each feature, otherwise (if 1) standardize each sample|axis=0 时求每个特征的均值,axis=1时求每个样本的均值
    y_std=Y_scaled.std(axis=0)
    print(Y_scaled)
    scaler= preprocessing.StandardScaler().fit(Y)#用StandardScaler类也能完成同样的功能
    print(scaler.transform(Y))

    输出:

    [[ 0.         -1.22474487  1.33630621]
    [ 1.22474487  0.         -0.26726124]
    [-1.22474487  1.22474487 -1.06904497]]
    [[ 0.         -1.22474487  1.33630621]
    [ 1.22474487  0.         -0.26726124]
    [-1.22474487  1.22474487 -1.06904497]]
    [Finished in 1.4s]
    • Note|说明
      1.func scale
      2.class StandardScaler
      3.StandardScaler 是一种Transformer方法,可以让pipeline来使用。
      MinMaxScaler (min-max标准化[0,1])类和MaxAbsScaler([-1,1])类是另外两个标准化的方式,用法和StandardScaler类似。
      4.处理稀疏数据时用MinMax和MaxAbs很合适
      5.鲁棒的数据标准化方法(适用于离群点很多的数据处理):

      the median and the interquartile range often give better results

    用中位数代替均值(使均值为0),用上四分位数-下四分位数代替方差(IQR为1?)。

  • 1.3.2 Impution of missing values|缺失值的处理
    • Usage
      代码:

      import scipy.sparse as sp
      from sklearn.preprocessing import Imputer
      X=sp.csc_matrix([[1,2],[0,3],[7,6]])
      imp=preprocessing.Imputer(missing_value=0,strategy=‘mean‘,axis=0)
      imp.fit(X)
      X_test=sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
      print(X_test)
      print(imp.transform(X_test))

      输出:

      (1, 0)    6
      (2, 0)    7
      (0, 1)    2
      (2, 1)    6
      [[ 4.          2.        ]
      [ 6.          3.66666675]
      [ 7.          6.        ]]
      [Finished in 0.6s]
    • Note
      1.scipy.sparse是用来存储稀疏矩阵的
      2.Imputer可以用来处理scipy.sparse稀疏矩阵
  • 1.3.3 Generating polynomial features
    • Usage
      代码:

      import numpy as np
      from sklearn.preprocessing import PolynomialFeatures
      X=np.arange(6).reshape(3,2)
      print(X)
      poly=PolynomialFeatures(2)
      print(poly.fit_transform(X))

      输出:

      [[0 1]
      [2 3]
      [4 5]]
      [[  1.   0.   1.   0.   0.   1.]
      [  1.   2.   3.   4.   6.   9.]
      [  1.   4.   5.  16.  20.  25.]]
      [Finished in 0.8s]
    • Note
      生成多项式特征用在多项式回归中以及多项式核方法中 。
  • 1.3.4 Custom transformers

    这是用来构造transform方法的函数

    • Usage:
      代码:

      import numpy as np
      from sklearn.preprocessing import FunctionTransformer
      transformer = FunctionTransformer(np.log1p)
      x=np.array([[0,1],[2,3]])
      print(transformer.transform(x))

      输出:

      [[ 0.          0.69314718]
      [ 1.09861229  1.38629436]]
      [Finished in 0.8s]
    • Note

      For a full code example that demonstrates using a FunctionTransformer to do custom feature selection, see Using FunctionTransformer to select columns

文/houhzize(简书作者)
原文链接:http://www.jianshu.com/p/94516a58314d
著作权归作者所有,转载请联系作者获得授权,并标注“简书作者”。

时间: 2024-08-02 11:03:34

python data analysis | python数据预处理(基于scikit-learn模块)的相关文章

scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类

scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import numpy as np from sklearn.pipeline import Pipeline from sklearn.linear_model import SGDClassifier from sklearn.grid_search import GridSearchCV from sk

Python.Data.Analysis(PACKT,2014)pdf

下载地址:网盘下载 Finding great data analysts is difficult. Despite the explosive growth of data in industries ranging from manufacturing and retail to high technology, finance, and healthcare, learning and accessing data analysis tools has remained a challe

pandas: powerful Python data analysis toolkit

pandas.read_csv pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, fal

数据预处理(数据的操作2)

2.常用数据预处理方法 这个部分总结的是在Python中常见的数据预处理方法. 2.1标准化(Standardization or Mean Removal and Variance Scaling) 变换后各维特征有0均值,单位方差.也叫z-score规范化(零均值规范化).计算方式是将特征值减去均值,除以标准差. sklearn.preprocessing.scale(X) 一般会把train和test集放在一起做标准化,或者在train集上做标准化后,用同样的标准化去标准化test集,此时

Spark的Python和Scala shell介绍(翻译自Learning.Spark.Lightning-Fast.Big.Data.Analysis)

Spark提供了交互式shell,交互式shell让我们能够点对点(原文:ad hoc)数据分析.如果你已经使用过R,Python,或者Scala中的shell,或者操作系统shell(例如bash),又或者Windows的命令提示符界面,你将会对Spark的shell感到熟悉. 但实际上Spark shell与其它大部分shell都不一样,其它大部分shell让你通过单个机器上的磁盘或者内存操作数据,Spark shell让你可以操作分布在很多机器上的磁盘或者内存里的数据,而Spark负责在集

python数据分析入门——数据导入数据预处理基本操作

数据导入到python环境:http://pandas.pydata.org/pandas-docs/stable/io.html(英文版) IO Tools (Text, CSV, HDF5, ...)? The pandas I/O API is a set of top level reader functions accessed like pd.read_csv() that generally return a pandasobject. read_csv read_excel re

Python For Data Analysis -- Pandas

首先pandas的作者就是这本书的作者 对于Numpy,我们处理的对象是矩阵 pandas是基于numpy进行封装的,pandas的处理对象是二维表(tabular, spreadsheet-like),和矩阵的区别就是,二维表是有元数据的 用这些元数据作为index更方便,而Numpy只有整形的index,但本质是一样的,所以大部分操作是共通的 大家碰到最多的二维表应用,关系型数据库中的表,有列名和行号,这些就是元数据 当然你可以用抽象的矩阵来对这些二维表做统计,但使用pandas会更方便  

《Python For Data Analysis》学习笔记-1

在引言章节里,介绍了MovieLens 1M数据集的处理示例.书中介绍该数据集来自GroupLens Research(http://www.groupLens.org/node/73),该地址会直接跳转到https://grouplens.org/datasets/movielens/,这里面提供了来自MovieLens网站的各种评估数据集,可以下载相应的压缩包,我们需要的MovieLens 1M数据集也在里面. 下载解压后的文件夹如下: 这三个dat表都会在示例中用到,但是我所阅读的<Pyt

《python for data analysis》第十章,时间序列

< python for data analysis >一书的第十章例程, 主要介绍时间序列(time series)数据的处理.label:1. datetime object.timestamp object.period object2. pandas的Series和DataFrame object的两种特殊索引:DatetimeIndex 和 PeriodIndex3. 时区的表达与处理4. imestamp object.period object的频率概念,及其频率转换5. 两种频