scikit-learn：4. 数据集预处理（clean数据、reduce降维、expand增维、generate特征提取）

本文參考：http://scikit-learn.org/stable/data_transforms.html

本篇主要讲数据预处理，包含四部分：

数据清洗、数据降维（PCA类）、数据增维（Kernel类）、提取自己定义特征。

哇哈哈。还是关注预处理比較靠谱。

。。

。

重要的不翻译：scikit-learn providesa library of transformers, which mayclean (see Preprocessing
data), reduce (seeUnsupervised
dimensionality reduction), expand (see Kernel
Approximation) or generate (see Feature
extraction) feature representations.

fit、transform、fit_transform三者差别：

fit：从训练集中学习模型的參数（比如，方差、中位数等；也可能是不同的词汇表）

transform：将训练集/測试集中的数据转换为fit学到的參数的维度上（測试集的方差、中位数等；測试集在fit得到的词汇表下的向量值等）。

fit_transform：同一时候进行fit和transform操作。

Like
other estimators, these are represented by classes with fit method,
which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method
which applies this transformation model to unseen data. fit_transform may
be more convenient and efficient for modelling and transforming the training data simultaneously.

八大块内容。翻译会在之后慢慢更新：

4.1.
Pipeline and FeatureUnion: combining estimators

4.1.1.
Pipeline: chaining estimators

4.1.2.
FeatureUnion: composite feature spaces

翻译之后的文章，參考：http://blog.csdn.net/mmc2015/article/details/46991465

4.2.
Feature extraction

4.2.3.
Text feature extraction

翻译之后的文章，參考：http://blog.csdn.net/mmc2015/article/details/46997379

4.2.4.
Image feature extraction

翻译之后的文章，參考：http://blog.csdn.net/mmc2015/article/details/46992105

4.3.
Preprocessing data

翻译之后的文章。參考：http://blog.csdn.net/mmc2015/article/details/47016313

4.3.1.
Standardization, or mean removal and variance scaling

4.3.2.
Normalization

4.3.3.
Binarization

4.3.4.
Encoding categorical features

4.3.5.
Imputation of missing values

4.4.
Unsupervised dimensionality reduction

翻译之后的文章，參考：http://blog.csdn.net/mmc2015/article/details/47066239

4.4.1.
PCA: principal component analysis

4.4.2.
Random projections

4.4.3.
Feature agglomeration （特征聚集）

4.5.
Random Projection

翻译之后的文章，參考：http://blog.csdn.net/mmc2015/article/details/47067003

4.5.1.
The Johnson-Lindenstrauss lemma

4.5.2.
Gaussian random projection

4.5.3.
Sparse random projection

4.6.
Kernel Approximation

翻译之后的文章，參考：http://blog.csdn.net/mmc2015/article/details/47068223

4.6.1.
Nystroem Method for Kernel Approximation

4.6.2.
Radial Basis Function Kernel

4.6.3.
Additive Chi Squared Kernel

4.6.4.
Skewed Chi Squared Kernel

4.6.5.
Mathematical Details

4.7.
Pairwise metrics, Affinities and Kernels

翻译之后的文章。參考：http://blog.csdn.net/mmc2015/article/details/47068895

4.7.1.
Cosine similarity

4.7.2.
Linear kernel

4.7.3.
Polynomial kernel

4.7.4.
Sigmoid kernel

4.7.5.
RBF kernel

4.7.6.
Chi-squared kernel

4.8.
Transforming the prediction target (y)

翻译之后的文章。參考：http://blog.csdn.net/mmc2015/article/details/47069869

4.8.1.
Label binarization

4.8.2.
Label encoding

时间： 2024-10-25 11:41:20

scikit-learn：4. 数据集预处理（clean数据、reduce降维、expand增维、generate特征提取）的相关文章

scikit-learn：数据集预处理（clean数据、reduce降维、expand增维、generate特征提取）

本文参考:http://scikit-learn.org/stable/data_transforms.html 本篇主要讲数据预处理,包括四部分: 数据清洗.数据降维(PCA类).数据增维(Kernel类).提取自定义特征.哇哈哈,还是关注预处理比较靠谱.... 重要的不翻译:scikit-learn provides a library of transformers, which may clean (see Preprocessing data), reduce (seeUnsuperv

scikit learn 模块调参 pipeline+girdsearch 数据举例：文档分类

scikit learn 模块调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import numpy as np from sklearn.pipeline import Pipeline from sklearn.linear_model import SGDClassifier from sklearn.grid_search import GridSearchCV from sk

Query意图分析：记一次完整的机器学习过程（scikit learn library学习笔记）

所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的垂直搜索引擎,专门为用户提供团购.优惠券的检索:同时存在一个通用的搜索引擎,比如百度,通用搜索引擎希望能够识别出一个Query是否具有O2O检索意图,如果有则调用O2O垂直搜索引擎,获取结果作为通用搜索引擎的结果补充. 我们的目的是学习出一个分类器(classifier),分类器可以理解为一个函数,

使用数据集(DataSet)、数据表(DataTable)、集合(Collection)传递数据

数据集(DataSet).数据表(DataTable).集合(Collection)概念是.NET FrameWork里提供数据类型,在应用程序编程过程中会经常使用其来作为数据的载体,属于ADO.NET的一部分.今天我们WCF分布式开发步步为赢第8节的内容:使用数据集(DataSet).数据表(DataTable).集合(Collection)传递数据.本节内容除了介绍几个类型概念外的,同样会详细给出代码的实现过程.此外我们会分析这几种数据类型的优势和缺点,以及在面向对象的服务开发过程中如何解决

Python之扩展包安装（scikit learn）

scikit learn 是Python下开源的机器学习包.(安装环境:win7.0 32bit和Python2.7) Python安装第三方扩展包较为方便的方法:easy_install + packages name 在官网 https://pypi.python.org/pypi/setuptools/#windows-simplified 下载名字为的文件. 在命令行窗口运行 ,安装后,可在python2.7文件夹下生成Scripts文件夹.把路径D:\Python27\Scripts

Deep learning：三十四(用NN实现数据的降维)

数据降维的重要性就不必说了,而用NN(神经网络)来对数据进行大量的降维是从2006开始的,这起源于2006年science上的一篇文章:reducing the dimensionality of data with neural networks,作者就是鼎鼎有名的Hinton,这篇文章也标志着deep learning进入火热的时代. 今天花了点时间读了下这篇文章,下面是一点笔记: 多层感知机其实在上世纪已经被提出来了,但是为什么它没有得到广泛应用呢?其原因在于对多层非线性网络进行权值优化时

计算机网络管理基础服务安装+大数据时代的网络运维

使用yum方式完成服务安装 ___By Nemo(仅供参考) Notice:Apache 安装好之后,我又改回桥接模式用rpm安装了. 首先,让你的虚拟机上个网,所以需要把网卡设置成nat模式,在宿主机上先拨个号,设好后重启linux系统. 重启后,打开firefox,看是不是能上网.Ok,但nat模式默认的ip是动态分配的,咱们得按照老师的要求把设成你静态的学生牌号.所以咱们得这么设一下!应该通过vmware虚拟机中-->Edit(编缉)-->Virtual Net Editor(虚拟网络

mysql插入数据后返回自增ID的方法

mysql插入数据后返回自增ID的方法 mysql和oracle插入的时候有一个很大的区别是,oracle支持序列做id,mysql本身有一个列可以做自增长字段,mysql在插入一条数据后,如何能获得到这个自增id的值呢? 方法一:是使用last_insert_id mysql> SELECT LAST_INSERT_ID(); 产生的ID 每次连接后保存在服务器中.这意味着函数向一个给定客户端返回的值是该客户端产生对影响AUTO_INCREMENT列的最新语句第一个 AUTO_INCREMEN

数据库主键自增插入显示值

SQL Server 2008 数据库主键自增插入显示值前几天在工作的时候遇到在删除数据库中表的数据的时候,删除之后,重新添加的数据没有得到原来的数据的id值(表中id为主键,且设置为自增) ,使用的是SQL Server 2008 ,现在已解决,和大家分享一下! 具体情况: 1.建立表t_test,设置主键自增,如下图 2.向表中插入数据由于表中的主键字段id为自增在插入的时候不需要指定显示插入,所以Sql 语句为 <span style="font-size:24px;"