scikit-learn(工程中用的相对较多的模型介绍):1.13. Feature selection

参考:http://scikit-learn.org/stable/modules/feature_selection.html

The classes in the sklearn.feature_selection module
can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.

1、removing features with low variance

VarianceThreshold 是特征选择的简单baseline方法,他删除方差达不到阈值的特征。默认情况下,删除all
zero-variance features, i.e. features that have the same value in all samples.

假设我们想要删除
 超过80%的样本数都是0或都是1(假设是boolean features)  的所有特征,由于boolean features是bernoulli随机变量,所以方差为Var[X] = p(1-p),所以我们可以使用阈值0.8*(1-0.8):

>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

删除了第一列,因为p=5/6 > 0.8。

2、Univariate
feature selection(单变量特征选择)(我用这个非常多)

Univariate
feature selection基于univariate statistical tests(单变量统计检验),分为:

  • SelectKBest removes
    all but the  highest scoring features
  • SelectPercentile removes
    all but a user-specified highest scoring percentage of features
  • using common univariate statistical tests for each feature: false positive rate SelectFpr,
    false discovery rate SelectFdr,
    or family wise error SelectFwe.
  • GenericUnivariateSelect allows
    to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.

例如,我们可以对样本集使用卡方检测,进而选择最好的两个features

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)

几个注意点:

1)These objects take as input a scoring function that returns univariate p-values:For
regression: f_regressionFor
classification: chi2 or f_classif。 

Beware not to use a regression scoring function with a classification problem, you will get useless results.

2)Feature selection with sparse data:If you use sparse data (i.e. data represented as sparse matrices), only chi2 will
deal with the data without making it dense.

例子:Univariate
Feature Selection

3、recursive
feature elimination(递归特征消除)

Given
an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE)
is to select features by recursively considering smaller and smaller sets of features.(从最初的所有特征集到逐步删除一个feature< features
whose absolute weights are the smallest are pruned from the current set features>,最后达到满足条件的features个数)。

RFECV performs
RFE in a cross-validation loop to find the optimal number of features.:

4、L1-based
feature selection

L1的sparse作用就不说了:

>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = LinearSVC(C=0.01, penalty="l1", dual=False).fit_transform(X, y)
>>> X_new.shape
(150, 3)

With SVMs and logistic-regression, the parameter C controls the sparsity: the smaller C the fewer features selected. With Lasso, the higher the alpha parameter, the fewer features selected.

Examples:

5、Tree-based
features selection(这个也用的比价多)

Tree-based estimators (see the sklearn.tree module
and forest of trees in the sklearn.ensemble module)
can be used to compute feature importances, which in turn can be used to discard irrelevant features:

>>>

>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> clf = ExtraTreesClassifier()
>>> X_new = clf.fit(X, y).transform(X)
>>> clf.feature_importances_
array([ 0.04...,  0.05...,  0.4...,  0.4...])
>>> X_new.shape
(150, 2)

6、Feature
selection as part of a pipeline

Feature selection is usually used as a pre-processing step before doing the actual learning. The recommended way to do this in scikit-learn is to use asklearn.pipeline.Pipeline:

clf = Pipeline([
  (‘feature_selection‘, LinearSVC(penalty="l1")),
  (‘classification‘, RandomForestClassifier())
])
clf.fit(X, y)

In this snippet we make use of a sklearn.svm.LinearSVC to
evaluate feature importances and select the most relevant features. Then, asklearn.ensemble.RandomForestClassifier is
trained on the transformed output, i.e. using only relevant features. You can perform similar operations with the other feature selection methods and also classifiers that provide a way to evaluate feature importances of course. See the sklearn.pipeline.Pipelineexamples
for more details.

版权声明:本文为博主原创文章,未经博主允许不得转载。

时间: 2024-10-25 06:44:52

scikit-learn(工程中用的相对较多的模型介绍):1.13. Feature selection的相关文章

scikit-learn(工程中用的相对较多的模型介绍):1.4. Support Vector Machines

参考:http://scikit-learn.org/stable/modules/svm.html 在实际项目中,我们真的很少用到那些简单的模型,比如LR.kNN.NB等,虽然经典,但在工程中确实不实用. 今天我们关注在工程中用的相对较多的SVM. SVM功能不少:Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outl

scikit-learn(工程中用的相对较多的模型介绍):1.12. Multiclass and multilabel algorithms

http://scikit-learn.org/stable/modules/multiclass.html 在实际项目中,我们真的很少用到那些简单的模型,比如LR.kNN.NB等,虽然经典,但在工程中确实不实用. 今天我们关注在工程中用的相对较多的 Multiclass and multilabel algorithms. warning:scikit-learn的所有分类器都是可以do multiclass classification out-of-the-box(可直接使用),所以没必要

scikit-learn(工程中用的相对较多的模型介绍):1.14. Semi-Supervised

参考:http://scikit-learn.org/stable/modules/label_propagation.html The semi-supervised estimators insklearn.semi_supervised are able to make use of this additional unlabeled data to better capture the shape of the underlying data distribution and gener

Query意图分析:记一次完整的机器学习过程(scikit learn library学习笔记)

所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的垂直搜索引擎,专门为用户提供团购.优惠券的检索:同时存在一个通用的搜索引擎,比如百度,通用搜索引擎希望能够识别出一个Query是否具有O2O检索意图,如果有则调用O2O垂直搜索引擎,获取结果作为通用搜索引擎的结果补充. 我们的目的是学习出一个分类器(classifier),分类器可以理解为一个函数,

Python之扩展包安装(scikit learn)

scikit learn 是Python下开源的机器学习包.(安装环境:win7.0 32bit和Python2.7) Python安装第三方扩展包较为方便的方法:easy_install + packages name 在官网 https://pypi.python.org/pypi/setuptools/#windows-simplified 下载名字为 的文件. 在命令行窗口运行 ,安装后,可在python2.7文件夹下生成Scripts文件夹.把路径D:\Python27\Scripts

scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类

scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import numpy as np from sklearn.pipeline import Pipeline from sklearn.linear_model import SGDClassifier from sklearn.grid_search import GridSearchCV from sk

机器学习-特征工程-Feature generation 和 Feature selection

概述:上节咱们说了特征工程是机器学习的一个核心内容.然后咱们已经学习了特征工程中的基础内容,分别是missing value handling和categorical data encoding的一些方法技巧.但是光会前面的一些内容,还不足以应付实际的工作中的很多情况,例如如果咱们的原始数据的features太多,咱们应该选择那些features作为咱们训练的features?或者咱们的features太少了,咱们能不能利用现有的features再创造出一些新的与咱们的target有更加紧密联系

Linear Regression with Scikit Learn

Before you read ?This is a demo or practice about how to use Simple-Linear-Regression in scikit-learn with python. Following is the package version that I use below: The Python version: 3.6.2 The Numpy version: 1.8.0rc1 The Scikit-Learn version: 0.19

Scikit Learn安装教程

Windows下安装scikit-learn 准备工作 Python (>= 2.6 or >= 3.3), Numpy (>= 1.6.1) Scipy (>= 0.9), Matplotlib(可选). NumPy NumPy系统是Python的一种开源的数值计算扩展.这种工具可用来存储和处理大型矩阵,比Python自身的嵌套列表(nested list structure)结构要高效的多(该结构也可以用来表示矩阵(matrix)). Scipy SciPy是一款方便.易于使用