Machine Learning Done Wrong【转】

1. Take default loss function for granted


Many practitioners train and pick the best model using the default loss
function (e.g., squared error). In practice, off-the-shelf loss function rarely
aligns with the business objective. Take fraud detection as an example. When
trying to detect fraudulent transactions, the business objective is to minimize
the fraud loss. The off-the-shelf loss function of binary classifiers weighs
false positives and false negatives equally. To align with the business
objective, the loss function should not only penalize false negatives more than
false positives, but also penalize each false negative in proportion to the
dollar amount. Also, data sets in fraud detection usually contain highly
imbalanced labels. In these cases, bias the loss function in favor of the rare
case (e.g., through up/down sampling).

2. Use plain linear models for non-linear interaction

When building a binary classifier, many practitioners immediately jump to
logistic regression because it’s simple. But, many also forget that logistic
regression is a linear model and the non-linear interaction among predictors
need to be encoded manually. Returning to fraud detection, high order
interaction features like "billing address = shipping address and transaction
amount < $50" are required for good model performance. So one should prefer
non-linear models like SVM with kernel or tree based classifiers that bake in
higher-order interaction features.

3. Forget about outliers

Outliers are interesting. Depending on the context, they either deserve
special attention or should be completely ignored. Take the example of revenue
forecasting. If unusual spikes of revenue are observed, it‘s probably a good
idea to pay extra attention to them and figure out what caused the spike. But if
the outliers are due to mechanical error, measurement error or anything else
that’s not generalizable, it’s a good idea to filter out these outliers before
feeding the data to the modeling algorithm.

Some models are more sensitive to outliers than others. For instance,
AdaBoost might treat those outliers as "hard" cases and put tremendous weights
on outliers while decision tree might simply count each outlier as one false
classification. If the data set contains a fair amount of outliers, it‘s
important to either use modeling algorithm robust against outliers or filter the
outliers out.

4. Use high variance model when n<<p

SVM is one of the most popular off-the-shelf modeling algorithms and one of
its most powerful features is the ability to fit the model with different
kernels. SVM kernels can be thought of as a way to automatically combine
existing features to form a richer feature space. Since this power feature comes
almost for free, most practitioners by default use kernel when training a SVM
model. However, when the data has n<<p (number of samples << number
of features) -- common in industries like medical data -- the richer feature
space implies a much higher risk to overfit the data. In fact, high variance
models should be avoided entirely when n<<p.

5. L1/L2/... regularization without standardization

Applying L1 or L2 to penalize large coefficients is a common way to
regularize linear or logistic regression. However, many practitioners are not
aware of the importance of standardizing features before applying those
regularization.

Returning to fraud detection, imagine a linear regression model with a
transaction amount feature. Without regularization, if the unit of transaction
amount is in dollars, the fitted coefficient is going to be around 100 times
larger than the fitted coefficient if the unit were in cents. With
regularization, as the L1 / L2 penalize larger coefficient more, the transaction
amount will get penalized more if the unit is in dollars. Hence, the
regularization is biased and tend to penalize features in smaller scales. To
mitigate the problem, standardize all the features and put them on equal footing
as a preprocessing step.

6. Use linear model without considering multi-collinear predictors

Imagine building a linear model with two variables X1 and X2 and suppose the
ground truth model is Y=X1+X2. Ideally, if the data is observed with small
amount of noise, the linear regression solution would recover the ground truth.
However, if X1 and X2 are collinear, to most of the optimization algorithms‘
concerns, Y=2*X1, Y=3*X1-X2 or Y=100*X1-99*X2 are all as good. The problem might
not be detrimental as it doesn‘t bias the estimation. However, it does make the
problem ill-conditioned and make the coefficient weight uninterpretable.

7. Interpreting absolute value of coefficients from linear or logistic
regression as feature importance

Because many off-the-shelf linear regressor returns p-value for each
coefficient, many practitioners believe that for linear models, the bigger the
absolute value of the coefficient, the more important the corresponding feature
is. This is rarely true as (a) changing the scale of the variable changes the
absolute value of the coefficient (b) if features are multi-collinear,
coefficients can shift from one feature to others. Also, the more features the
data set has, the more likely the features are multi-collinear and the less
reliable to interpret the feature importance by coefficients.

So there you go: 7 common mistakes when doing ML in practice. This list is
not meant to be exhaustive but merely to provoke the reader to consider modeling
assumptions that may not be applicable to the data at hand. To achieve the best
model performance, it is important to pick the modeling algorithm that makes the
most fitting assumptions -- not just the one you’re most familiar with.

原文:http://ml.posthaven.com/machine-learning-done-wrong

时间: 2024-07-29 06:39:29

Machine Learning Done Wrong【转】的相关文章

【机器学习实战】Machine Learning in Action 代码 视频 项目案例

MachineLearning 欢迎任何人参与和完善:一个人可以走的很快,但是一群人却可以走的更远 Machine Learning in Action (机器学习实战) | ApacheCN(apache中文网) 视频每周更新:如果你觉得有价值,请帮忙点 Star[后续组织学习活动:sklearn + tensorflow] ApacheCN - 学习机器学习群[629470233] 第一部分 分类 1.) 机器学习基础 2.) k-近邻算法 3.) 决策树 4.) 基于概率论的分类方法:朴素

Machine Learning In Action 第二章学习笔记: kNN算法

本文主要记录<Machine Learning In Action>中第二章的内容.书中以两个具体实例来介绍kNN(k nearest neighbors),分别是: 约会对象预测 手写数字识别 通过“约会对象”功能,基本能够了解到kNN算法的工作原理.“手写数字识别”与“约会对象预测”使用完全一样的算法代码,仅仅是数据集有变化. 约会对象预测 1 约会对象预测功能需求 主人公“张三”喜欢结交新朋友.“系统A”上面注册了很多类似于“张三”的用户,大家都想结交心朋友.“张三”最开始通过自己筛选的

[Machine Learning] 国外程序员整理的机器学习资源大全

本文汇编了一些机器学习领域的框架.库以及软件(按编程语言排序). 1. C++ 1.1 计算机视觉 CCV —基于C语言/提供缓存/核心的机器视觉库,新颖的机器视觉库 OpenCV—它提供C++, C, Python, Java 以及 MATLAB接口,并支持Windows, Linux, Android and Mac OS操作系统. 1.2 机器学习 MLPack DLib ecogg shark 2. Closure Closure Toolbox—Clojure语言库与工具的分类目录 3

New to Machine Learning? Avoid these three mistakes

http://blog.csdn.net/pipisorry/article/details/43973171 James Faghmous提醒机器学习初学者要避免的三方面错误,推荐阅读 New to Machine Learning? Avoid these three mistakes Common pitfalls when learning from data Machine learning (ML) is one of the hottest fields in data scien

Machine Learning - VI. Logistic Regression (Week 3)

http://blog.csdn.net/pipisorry/article/details/43884027 机器学习Machine Learning - Andrew NG courses学习笔记 Classification  0.1表示含义 denote with 0 is the negative class denote with 1 is the positive class.  Hypothesis Representation  Decision Boundary  Cost

Machine Learning第十一周笔记:photo OCR

博客已经迁移至Marcovaldo's blog (http://marcovaldong.github.io/) 刚刚完毕了Cousera上Machine Learning的最后一周课程.这周介绍了machine learning的一个应用:photo OCR(optimal character recognition,光学字符识别),以下将笔记整理在以下. Photo OCR Problem Description and Pipeline 最后几小节介绍机器学习的一个应用--photo O

[kaggle入门] Titanic Machine Learning from Disaster

Titanic Data Science Solutions¶ https://www.kaggle.com/startupsci/titanic-data-science-solutions 数据挖掘竞赛七个步骤:¶ Question or problem definition. Acquire training and testing data. Wrangle, prepare, cleanse the data. Analyze, identify patterns, and explo

An introduction to machine learning with scikit-learn

转自 http://scikit-learn.org/stable/tutorial/basic/tutorial.html#machine-learning-the-problem-setting In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than

一种压缩图片的方法---Machine learning 之 K-Means

背景描述: RGB编码:对于一个直接用24bit表示每一个而像素的图像来说,每一个pixel使用8-bit无符号整数(0-255)来表示红or绿or蓝. 压缩目的: 将128x128大小的图片由原来的24bit表示-压缩成->16bit表示每一个像素的图像. 压缩方法: 对于每一个pixel, 使用 K-Means选择16bits来表示原来的24bits.当然,具体是通过计算每一个像素空间的16bits大小的聚类来表示原来的24bits. 实现步骤: 1.将原来的128x128大小的图片读入到一

Introduction to Machine Learning

Chapter 1 Introduction 1.1 What Is Machine Learning? To solve a problem on a computer, we need an algorithm. An algorithm is a sequence of instructions that should be carried out to transform the input to output. For example, one can devise an algori