the steps that may be taken to solve a feature selection problem:特征选择的步骤

参考:JMLR的paper《an introduction to variable and feature selection》

we summarize the steps that may be taken to solve a feature selection problem in a check list:

1. Do you have domain knowledge? If yes, construct a better set of “ad hoc” features.

2. Are your features commensurate(可以同单位度量的)? If no, consider normalizing them.

3. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features(通过构建联合特征<应该是多个variables当做一个feature>或高次特征,扩展您的功能集), as much as your computer resources
allow you(see example of use in Section 4.4).

4. Do you need to prune(裁剪) the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of features(构建析取特征<应该是一个variables当做一个feature>或加权和特征) (e.g. by clustering or matrix factorization, see
Section 5).

5. Do you need to assess features individually(单独评估每个feature) (e.g. to understand their in?uence on the system or because their number is so large that you need to do a ?rst ?ltering)? If yes, use a variable ranking method (Section 2 and Section 7.2); else,
do it anyway to get baseline results.

6. Do you need a predictor? If no, stop.

7. Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them(注意:这里的them是example的意思,不是feature。。。).

8. Do you know what to try ?rst? If no, use a linear predictor. Use a forward selection method(Section 4.2) with the “probe” method as a stopping criterion (Section 6) or use the L0-norm embedded
method (Section 4.3). For comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with
that subset.

9. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coef?cients, backward selection and embedded methods (Section 4). Use linear and non-linear
predictors. Select the best approach with model selection (Section 6).

10. Do you want a stable solution (to improve performance and/or understanding)? If yes, sub-sample your data and redo your analysis for several “bootstraps” (Section 7.1)

版权声明:本文为博主原创文章,未经博主允许不得转载。

时间: 2024-10-11 09:22:00

the steps that may be taken to solve a feature selection problem:特征选择的步骤的相关文章

How to solve TFS slow response problem?

/* Author : Jiangong SUN */ Assume that your environment is Visual studio and TFS server. At one day, when you make modifications in your solution, such as, create a new file, or update a file name. Then visual studio takes a long time to respond. Yo

[Algorithms] Using Dynamic Programming to Solve longest common subsequence problem

Let's say we have two strings: str1 = 'ACDEB' str2 = 'AEBC' We need to find the longest common subsequence, which in this case should be 'AEB'. Using dynamic programming, we want to compare by char not by whole words. we need memo to keep tracking th

python data analysis | python数据预处理(基于scikit-learn模块)

原文:http://www.jianshu.com/p/94516a58314d Dataset transformations| 数据转换 Combining estimators|组合学习器 Feature extration|特征提取 Preprocessing data|数据预处理 1 Dataset transformations scikit-learn provides a library of transformers, which may clean (see Preproce

Y460/Y470 Nvidia optirum solution : switch off the nvidia card and solve screen flash problem

i have such a long time fall in love with ubuntu , but i found it's much unconfortable for me with a lenovo y460n , that owning a intel display card and a Nvidia GT425M card . once i install ubuntu (from 10.04 , 12.04 , 14.04 , so many versions i had

6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)

6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python) Introduction Here’s a situation you’ve got into: You are working on a classification problem and you have generated your set of hypothesis, created features and discussed the importanc

机器学习初学者的10个课程推荐

转自:https://hackerlists.com/beginner-ml-courses/ 10 Machine Learning Online Courses For Beginners 10 Machine Learning Online Courses For Beginners The following is a list of, mostly free, machine learning online courses for beginners. If video lecture

Deep Learning in a Nutshell: History and Training

Deep Learning in a Nutshell: History and Training This series of blog posts aims to provide an intuitive and gentle introduction to deep learning that does not rely heavily on math or theoretical constructs. The first part in this series provided an

【转载】Getting Started with Spark (in Python)

Getting Started with Spark (in Python) Benjamin Bengfort Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has becom

A Gentle Guide to Machine Learning

A Gentle Guide to Machine Learning Machine Learning is a subfield within Artificial Intelligence that builds algorithms that allow computers to learn to perform tasks from data instead of being explicitly programmed. Got it? We can make machines lear