本文參考:http://scikit-learn.org/stable/data_transforms.html
本篇主要讲数据预处理,包含四部分:
数据清洗、数据降维(PCA类)、数据增维(Kernel类)、提取自己定义特征。
哇哈哈。还是关注预处理比較靠谱。
。。
。
重要的不翻译:scikit-learn providesa library of transformers, which mayclean (see Preprocessing
data), reduce (seeUnsupervised
dimensionality reduction), expand (see Kernel
Approximation) or generate (see Feature
extraction) feature representations.
fit、transform、fit_transform三者差别:
fit:从训练集中学习模型的參数(比如,方差、中位数等;也可能是不同的词汇表)
transform:将训练集/測试集中的数据转换为fit学到的參数的维度上(測试集的方差、中位数等;測试集在fit得到的词汇表下的向量值等)。
fit_transform:同一时候进行fit和transform操作。
Like
other estimators, these are represented by classes with fit method,
which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method
which applies this transformation model to unseen data. fit_transform may
be more convenient and efficient for modelling and transforming the training data simultaneously.
八大块内容。翻译会在之后慢慢更新:
4.1.
Pipeline and FeatureUnion: combining estimators
4.1.1.
Pipeline: chaining estimators
4.1.2.
FeatureUnion: composite feature spaces
翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/46991465
4.2.3.
Text feature extraction
翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/46997379
4.2.4.
Image feature extraction
翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/46992105
翻译之后的文章。參考:http://blog.csdn.net/mmc2015/article/details/47016313
4.3.1.
Standardization, or mean removal and variance scaling
4.3.4.
Encoding categorical features
4.3.5.
Imputation of missing values
4.4.
Unsupervised dimensionality reduction
翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/47066239
4.4.1.
PCA: principal component analysis
4.4.3.
Feature agglomeration (特征聚集)
翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/47067003
4.5.1.
The Johnson-Lindenstrauss lemma
4.5.2.
Gaussian random projection
4.5.3.
Sparse random projection
翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/47068223
4.6.1.
Nystroem Method for Kernel Approximation
4.6.2.
Radial Basis Function Kernel
4.6.3.
Additive Chi Squared Kernel
4.6.4.
Skewed Chi Squared Kernel
4.7.
Pairwise metrics, Affinities and Kernels
翻译之后的文章。參考:http://blog.csdn.net/mmc2015/article/details/47068895
4.8.
Transforming the prediction target (y)
翻译之后的文章。參考:http://blog.csdn.net/mmc2015/article/details/47069869