11 Clever Methods of Overfitting and how to avoid them

11 Clever Methods of Overfitting and how to avoid them

Overfitting is the bane of Data Science in the age of Big Data. John Langford reviews "clever" methods of overfitting, including traditional, parameter tweak, brittle measures, bad statistics, human-loop overfitting, and gives suggestions and directions for avoiding overfitting.

 comments

By John Langford (Microsoft, Hunch.net)

(Gregory Piatetsky: I recently came across this classic post from 2005 by John Langford Clever Methods of Overfitting, which addresses one of the most critical issues in Data Science. The problem of overfitting is a major bane of big data, and the issues described below are perhaps even more relevant than before. I have made several of these mistakes myself in the past. John agreed to repost it in KDnuggets, so enjoy and please comment if you find new methods)

“Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: over-representing performance on particular datasets and (implicitly) over-representing performance of a method on future datasets.

We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid.

1. Traditional overfitting: Train a complex predictor on too-few examples.

Remedy:

  1. Hold out pristine examples for testing.
  2. Use a simpler predictor.
  3. Get more training examples.
  4. Integrate over many predictors.
  5. Reject papers which do this.

2. Parameter tweak overfitting: Use a learning algorithm with many parameters. Choose the parameters based on the test set performance.

For example, choosing the features so as to optimize test set performance can achieve this.

Remedy: same as above

3. Brittle measure: Use a measure of performance which is especially brittle to overfitting.

Examples: “entropy”, “mutual information”, and leave-one-out cross-validation are all surprisingly brittle. This is particularly severe when used in conjunction with another approach.

Remedy: Prefer less brittle measures of performance.

4. Bad statistics: Misuse statistics to overstate confidences.

One common example is pretending that cross validation performance is drawn from an i.i.d. gaussian, then using standard confidence intervals. Cross validation errors are not independent. Another standard method is to make known-false assumptions about some system and then derive excessive confidence.

Remedy: Don’t do this. Reject papers which do this.

5. Choice of measure: Choose the best of Accuracy, error rate, (A)ROC, F1, percent improvement on the previous best, percent improvement of error rate, etc.. for your method. For bonus points, use ambiguous graphs.

This is fairly common and tempting.

Remedy: Use canonical performance measures. For example, the performance measure directly motivated by the problem.

6. Incomplete Prediction: Instead of (say) making a multiclass prediction, make a set of binary predictions, then compute the optimal multiclass prediction.

Sometimes it’s tempting to leave a gap filled in by a human when you don’t otherwise succeed.

Remedy: Reject papers which do this.

7. Human-loop overfitting: Use a human as part of a learning algorithm and don’t take into account overfitting by the entire human/computer interaction.

This is subtle and comes in many forms. One example is a human using a clustering algorithm (on training and test examples) to guide learning algorithm choice.

Remedy: Make sure test examples are not available to the human.

8. Data set selection: Chose to report results on some subset of datasets where your algorithm performs well.

The reason why we test on natural datasets is because we believe there is some structure captured by the past problems that helps on future problems. Data set selection subverts this and is very difficult to detect.

Remedy: Use comparisons on standard datasets. Select datasets without using the test set. Good Contest performance can’t be faked this way.

9. Reprobleming: Alter the problem so that your performance improves.

Remedy: For example, take a time series dataset and use cross validation. Or, ignore asymmetric false positive/false negative costs. This can be completely unintentional, for example when someone uses an ill-specified UCI dataset.

Remedy: Discount papers which do this. Make sure problem specifications are clear.

10. Old datasets: Create an algorithm for the purpose of improving performance on old datasets.

After a dataset has been released, algorithms can be made to perform well on the dataset using a process of feedback design, indicating better performance than we might expect in the future. Some conferences have canonical datasets that have been used for a decade…

Remedy: Prefer simplicity in algorithm design. Weight newer datasets higher in consideration. Making test examples not publicly available for datasets slows the feedback design process but does not eliminate it.

11. Overfitting by review: 10 people submit a paper to a conference. The one with the best result is accepted.

This is a systemic problem which is very difficult to detect or eliminate. We want to prefer presentation of good results, but doing so can result in overfitting.

Remedy:

  1. Be more pessimistic of confidence statements by papers at high rejection rate conferences.
  2. Some people have advocated allowing the publishing of methods with poor performance. (I have doubts this would work.)

I have personally observed all of these methods in action, and there are doubtless others.

Selected comments on John‘s post:

Negative results:

  • Aleks Jakulin: How about an index of negative results in machine learning? There’s a Journal of Negative Results in other domains:Ecology & Evolutionary BiologyBiomedicine, and there is Journal of Articles in Support of the Null Hypothesis. A section on negative results in machine learning conferences? This kind of information is very useful in preventing people from taking pathways that lead nowhere: if one wants to classify an algorithm into good/bad, one certainly benefits from unexpectedly bad examples too, not just unexpectedly good examples.
  • John Langford I visited the workshop on negative results at NIPS 2002. My impression was that it did not work well. The difficulty with negative results in machine learning is that they are too easy. For example, there are a plethora of ways to say that “learning is impossible (in the worst case)”. On the applied side, it’s still common for learning algorithms to not work on simple-seeming problems. In this situation, positive results (this works) are generally more valuable than negative results (this doesn’t work).

Brittle measures

  • What do you mean by “brittle”? Why is mutual information brittle?
  • John Langford : What I mean by brittle: Suppose you have a box which takes some feature values as input and predicts some probability of label 1 as output. You are not allowed to open this box or determine how it works other than by this process of giving it inputs and observing outputs. 

    Let x be an input.
    Let y be an output.
    Assume (x,y) are drawn from a fixed but unknown distribution D.
    Let p(x) be a prediction.

    For classification error I(|y – p(x)| < 0.5) you can prove a theorem of the rough form:
    for all D, with high probability over the draw of m examples independently from D, expected classification error rate of the box with respect to D is bounded by a function of the observations. 

    What I mean by “brittle” is that no statement of this sort can be made for any unbounded loss (including log-loss which is integral to mutual information and entropy). You can of course open up the box and analyze its structure or make extra assumptions about D to get a similar but inherently more limited analysis.

    The situation with leave-one-out cross validation is not so bad, but it’s still pretty bad. In particular, there exists a very simple learning algorithm/problem pair with the property that the leave-one-out estimate has the variance and deviations of a single coin flip. Yoshua Bengio and Yves Grandvalet in fact proved that there is no unbiased estimator of variance. The paper that I pointed to above shows that for K-fold cross validation on m examples, all moments of the deviations might only be as good as on a test set of size $m/K$.

    I’m not sure what a ‘valid summary’ is, but leave-one-out cross validation can not provide results I trust, because I know how to break it.

    I have personally observed people using leave-one-out cross validation with feature selection to quickly achieve a severe overfit.

Related:

时间: 2024-08-03 15:33:50

11 Clever Methods of Overfitting and how to avoid them的相关文章

【转载】COMMON PITFALLS IN MACHINE LEARNING

COMMON PITFALLS IN MACHINE LEARNING JANUARY 6, 2015 DN 3 COMMENTS Over the past few years I have worked on numerous different machine learning problems. Along the way I have fallen foul of many sometimes subtle and sometimes not so subtle pitfalls wh

Annotation

从JDK1.5开始,Java就增加了Annotation这个新的功能,这种特性被称为元数据特性,同时也被称为注释. 系统内建的Annotation: 提醒:以下这三个系统内建的Annotation位于java.lang包下 [email protected],相信大家对这个比较熟悉,如果我们要重写一个类的方法的时候,要加上这个注解,但是很多人会反问,不加也是没问题的,但是我们必须考虑到的是程序的正确性,如果你本身的意图是重写这个方法,但是你在写的时候把方法名写错了,那么这就不是重写了,也改变了意

java多线程基本概述(九)——ThreadLocal

public interface Lock Lock implementations provide more extensive locking operations than can be obtained using synchronized methods and statements. //Lock对象提供比同步方法或者同步块更多的灵活性和拓展性,They allow more flexible structuring, may have quite different propert

scikit_learn 官方文档翻译(集成学习)

1.11. Ensemble methods(集成学习) 目标: 相对于当个学习器,集成学习通过使用多个基学习器的预测结果来来提高学习预测的泛化性能以及鲁棒性: 集成学习的两个思路: 1).通过使用并行的学习,得到多个学习模型然后取其平均结果目的在于减少方差,代表算法有随机森林.通常来说多个学习器的集成要比单个学习器的效果要好多. 2).采用串行的方式生成多个学习器目的在于减少偏差(bias),使用多个弱分类器组合成为一个强分类器,代表算法adaBoosting以及boosting tree.G

Apache CXF 3.0: CDI 1.1 Support as Alternative to Spring--reference

With Apache CXF 3.0 just being released a couple of weeks ago, the project makes yet another important step to fulfill the JAX-RS 2.0 specification requirements: integration with CDI 1.1. In this blog post we are going to look on a couple of examples

Java中的HashMap和HashTable到底哪不同?

学习Java的同学注意了!!! 学习过程中遇到什么问题或者想获取学习资源的话,欢迎加入Java学习交流群,群号码:456544752  我们一起学Java! HashMap和HashTable有什么不同?在面试和被面试的过程中,我问过也被问过这个问题,也见过了不少回答,今天决定写一写自己心目中的理想答案. 代码版本 JDK每一版本都在改进.本文讨论的HashMap和HashTable基于JDK 1.7.0_67.源码见这里 1. 时间 HashTable产生于JDK 1.1,而HashMap产生

angular和vue双向数据绑定

angular和vue双向数据绑定的原理(重点是vue的双向绑定) 我在整理javascript高级程序设计的笔记的时候看到面向对象设计那章,讲到对象属性分为数据属性和访问器属性,我们平时用的js对象90%以上都只是用到数据属性;我们向来讲解下数据属性和访问器属性到底是什么? 数据属性:数据属性包含一个数据值的位置,在这个位置可以读取和写入值. 访问器属性:访问器属性不包含数据值;他们包含一对getter和setter函数在读取访问器属性时,会调用getter函数,这个函数负责返回有效的值,在写

Qt Quick应用开发介绍 13 (JavaScript)

Chapter13 Annexure: JavaScript Language Overview 附录: JavaScript语言概览 Js语言总览; 提供一个Qt支持的所有语言特性的概览; 通过本文了解Js语言的基本特性; 特别是当你开始学习一个相关的技术, 如QML时, 你可以在这获得帮助; 这篇文章是对 JavaScript Language Overview http://qt-project.org/wiki/JavaScript 的轻微改动版本; 内容经过Qt4.8和QtQuick1

2017找工作_机器学习相关面经

1.什么是boosting tree 2.GBDT 3.L1和L2正则为何可以减弱over-fitting,L1和L2正则有什么区别? 4.KNN和LR有什么本质区别 5.怎么理解Dropout 6.为什么random forest具有特征选择的功能? 7.random forest有哪些重要的参数? 8.DNN为什么功能强大,说说你的理解 9.SVM的损失函数是什么?怎么理解 10.介绍下Maxout 11.项目中over-fitting了,你怎么办 12.详细说一个你知道的优化算法(Adam