Feature Engineering versus Feature Extraction: Game On!

Feature Engineering versus Feature Extraction: Game On!

"Feature engineering" is a fancy term for making sure that your predictors are encoded in the model in a manner that makes it as easy as possible for the model to achieve good performance. For example, if your have a date field as a predictor and there are larger differences in response for the weekends versus the weekdays, then encoding the date in this way makes it easier to achieve good results.

However, this depends on a lot of things.

First, it is model-dependent. For example, trees might have trouble with a classification data set if the class boundary is a diagonal line since their class boundaries are made using orthogonal slices of the data (oblique trees excepted).

Second, the process of predictor encoding benefits the most from subject-specific knowledge of the problem. In my example above, you need to know the patterns of your data to improve the format of the predictor. Feature engineering is very different in image processing, information retrieval, RNA expressions profiling, etc. You need to know something about the problem and your particular data set to do it well.

Here is some training set data where two predictors are used to model a two-class system (I‘ll unblind the data at the end):

There is also a corresponding test set that we will use below.

There are some observations that we can make:

  • The data are highly correlated (correlation = 0.85)
  • Each predictor appears to be fairly right-skewed
  • They appear to be informative in the sense that you might be able to draw a diagonal line to differentiate the classes

Depending on what model that we might choose to use, the between-predictor correlation might bother us. Also, we should look to see of the individual predictors are important. To measure this, we‘ll use the area under the ROC curve on the predictor data directly.

Here are univariate box-plots of each predictor (on the log scale):

There is some mild differentiation between the classes but a significant amount of overlap in the boxes. The area under the ROC curves for predictor A and B are 0.61 and 0.59, respectively. Not so fantastic.

What can we do? Principal component analysis (PCA) is a pre-processing method that does a rotation of the predictor data in a manner that creates new synthetic predictors (i.e. the principal components or PC‘s). This is conducted in a way where the first component accounts for the majority of the (linear) variation or information in the predictor data. The second component does the same for any information in the data that remains after extracting the first component and so on. For these data, there are two possible components (since there are only two predictors). Using PCA in this manner is typically called feature extraction.

Let‘s compute the components:

> library(caret)
> head(example_train)
   PredictorA PredictorB Class
2    3278.726  154.89876   One
3    1727.410   84.56460   Two
4    1194.932  101.09107   One
12   1027.222   68.71062   Two
15   1035.608   73.40559   One
16   1433.918   79.47569   One
> pca_pp <- preProcess(example_train[, 1:2],
+                      method = c("center", "scale", "pca"))
+ pca_pp

Call:
preProcess.default(x = example_train[, 1:2], method = c("center",
 "scale", "pca"))

Created from 1009 samples and 2 variables
Pre-processing: centered, scaled, principal component signal extraction 

PCA needed 2 components to capture 95 percent of the variance
> train_pc <- predict(pca_pp, example_train[, 1:2])
> test_pc <- predict(pca_pp, example_test[, 1:2])
> head(test_pc, 4)
        PC1         PC2
1 0.8420447  0.07284802
5 0.2189168  0.04568417
6 1.2074404 -0.21040558
7 1.1794578 -0.20980371

Note that we computed all the necessary information from the training set and apply these calculations to the test set. What do the test set data look like?

These are the test set predictors simply rotated.

PCA is unsupervised, meaning that the outcome classes are not considered when the calculations are done. Here, the area under the ROC curves for the first component is 0.5 and 0.81 for the second component. These results jive with the plot above; the first component has an random mixture of the classes while the second seems to separate the classes well. Box plots of the two components reflect the same thing:

There is much more separation in the second component.

This is interesting. First, despite PCA being unsupervised, it managed to find a new predictor that differentiates the classes. Secondly, it is the last component that is most important to the classes but the least important to the predictors. It is often said that PCA doesn‘t guarantee that any of the components will be predictive and this is true. Here, we get lucky and it does produce something good.

However, imagine that there are hundreds of predictors. We may only need to use the first X components to capture the majority of the information in the predictors and, in doing so, discard the later components. In this example, the first component accounts for 92.4% of the variation in the predictors; a similar strategy would probably discard the most effective predictor.

How does the idea of feature engineering come into play here? Given these two predictors and seeing the first scatterplot shown above, one of the first things that occurs to me is "there are two correlated, positive, skewed predictors that appear to act in tandem to differentiate the classes". The second thing that occurs to be is "take the ratio". What does that data look like?

The corresponding area under the ROC curve is 0.8, which is nearly as good as the second component. A simple transformation based on visually exploring the data can do just as good of a job as an unbiased empirical algorithm.

These data are from the cell segmentation experiment of Hill et al,and predictor A is the "surface of a sphere created from by rotating the equivalent circle about its diameter" (labeled as EqSphereAreaCh1 in the data) and predictor B is the perimeter of the cell nucleus (PerimCh1). A specialist in high content screening might naturally take the ratio of these two features of cells because it makes good scientific sense (I am not that person). In the context of the problem, their intuition should drive the feature engineering process.

However, in defense of an algorithm such as PCA, the machine has some benefit. In total, there are almost sixty predictors in these data whose features are just as arcane as EqSphereAreaCh1. My personal favorite is the "Haralick texture measurement of the spatial arrangement of pixels based on the co-occurrence matrix". Look that one up some time. The point is that there are often too many features to engineer and they might be completely unintuitive from the start.

Another plus for feature extraction is related to correlation. The predictors in this particular data set tend to have high between-predictor correlations and for good reasons. For example, there are many different ways to quantify the eccentricity of a cell (i.e. how elongated it is). Also, the size of a cell‘s nucleus is probably correlated with the size of the overall cell and so on. PCA can mitigate the effect of these correlations in one fell swoop. An approach of manually taking ratios of many predictors seems less likely to be effective and would take more time.

Last year, in one of the R&D groups that I support, there was a bit of a war being waged between the scientists who focused on biased analysis (i.e. we model what we know) versus the unbiased crowd (i.e. just let the machine figure it out). I fit somewhere in-between and believe that there is a feedback loop between the two. The machine can flag potentially new and interesting features that, once explored, become part of the standard book of "known stuff".

时间: 2024-10-03 14:24:55

Feature Engineering versus Feature Extraction: Game On!的相关文章

AI学习---特征工程(Feature Engineering)

为什么需要特征工程(Feature Engineering) 数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已 什么是特征工程 帮助我们使得算法性能更好发挥性能而已 sklearn主要用于特征工程pandas主要用于数据清洗.数据处理 特征工程包含如下3个内容: 1.特征抽取/特征提取 |__>字典特征抽取 |__>文本特征抽取 |__>图像特征抽取(深度学习) 2.特征预处理 3.特征降维 特征抽取/特征提取 我们常说的机器学习算法实际上就是我们统计学上的统计方法也就是

机器学习-特征工程-Feature generation 和 Feature selection

概述:上节咱们说了特征工程是机器学习的一个核心内容.然后咱们已经学习了特征工程中的基础内容,分别是missing value handling和categorical data encoding的一些方法技巧.但是光会前面的一些内容,还不足以应付实际的工作中的很多情况,例如如果咱们的原始数据的features太多,咱们应该选择那些features作为咱们训练的features?或者咱们的features太少了,咱们能不能利用现有的features再创造出一些新的与咱们的target有更加紧密联系

tensorflow-TFRecord报错ValueError: Protocol message Feature has no &quot;feature&quot; field.

编写代码用TFRecord数据结构存储数据集信息是报错:ValueError: Protocol message Feature has no "feature" field.或和这个错误非常类似的错误. 请检查 features=tf.train.Features(feature={...} 里面有没有单子写错.如果有一个单词写错都会报和上面类似的错误 原文地址:https://www.cnblogs.com/bestExpert/p/10546274.html

Feature Engineering

1. remove skew Why: Many model built on the hypothsis that the input data are distributed as a 'Normal Distribution'(Gaussian Distribution). So if the input data is more like Normal Distribution, the results are better. Methods: remove skewnewss: log

[特征选择] DIscover Feature Engineering, How to Engineer Features and How to Get Good at It 翻译

本文是对Jason Brownlee的特征设计文章的翻译, 原文链接在这里. 特征设计是一个非正式话题, 但它毫无疑问被认为是成功应用机器学习的关键. 创建这个导文的时候, 我尽我所能的广泛而深入的学习和分析了所有材料. 你将发现什么是特征设计, 它解决了什么问题, 它为什么重要, 如何设计特征, 谁在这方面做的很好和你可以去哪里学到更多并擅长它. 如果你读一篇关于特征设计的文章, 我希望是这一篇. "特征设计是另一个话题, 它似乎并不值得在任何评论文章或者书籍, 甚至书中的章节中称誉, 但它绝

最近看到的“特征工程”相关文章,特汇总在一起方便浏览~

最近看到的“特征工程”相关文章,特汇总在一起方便浏览~ 对于搞数据的和玩深度学习的特征工程是不可少的一环,尤其是特征选择,好的特征选择能够提升模型的性能,更能帮助我们理解数据的特点.底层结构,这对进一步改善模型.算法都有着重要作用. 这里先上一篇总括<特征工程技术与方法> 这篇文章详细梳理了特征工程所包含的内容,对机器学习中的特征.特征的重要性.特征提取与选择.特征的构建.学习等子类问题也给与了总结,总之看过这篇文章之后对特征工程就能有一个总体的认识.(这张图总结的太好了,一目了然,贴出来!)

特征工程 vs. 特征提取

“特征工程”这个华丽的术语,它以尽可能容易地使模型达到良好性能的方式,来确保你的预测因子被编码到模型中.例如,如果你有一个日期字段作为一个预测因子,并且它在周末与平日的响应上有着很大的不同,那么以这种方式编码日期,它更容易取得好的效果. 但是,这取决于许多方面. 首先,它是依赖模型的.例如,如果类边界是一个对角线,那么树可能会在分类数据集上遇到麻烦,因为分类边界使用的是数据的正交分解(斜树除外). 其次,预测编码过程从问题的特定学科知识中受益最大.在我刚才列举的例子中,你需要了解数据模式,然后改

浅谈深度学习中潜藏的稀疏表达

浅谈深度学习中潜藏的稀疏表达 “王杨卢骆当时体,轻薄为文哂未休. 尔曹身与名俱灭,不废江河万古流.” — 唐 杜甫<戏为六绝句>(其二) [不要为我为啥放这首在开头,千人千面千理解吧] 深度学习:概述和一孔之见 深度学习(DL),或说深度神经网络(DNN), 作为传统机器学习中神经网络(NN).感知机(perceptron)模型的扩展延伸,正掀起铺天盖地的热潮.DNN火箭般的研究速度,在短短数年内带来了能“读懂”照片内容的图像识别系统,能和人对话到毫无PS痕迹的语音助手,能击败围棋世界冠军.引

2019S2 BUSS6002

2019S2 BUSS6002 Assignment 1Value: 15% of the total markInstructions1. Required Submission Items:1. ONE written report (PDF format). submitted via Canvas.• Assignments > Report Submission (Assignment 1)2. ONE Jupyter Notebook .ipynb submitted via Can