ISL - Ch4. Classification

The linear regression model discussed in Chapter 3 assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative.

In this chapter we discuss three of the most widely-used classifiers: logistic regression, linear discriminant analysis, and K-nearest neighbors. We discuss more computer-intensive methods in later chapters, such as generalized additive models (Chapter 7), trees, random forests, and boosting (Chapter 8), and support vector machines (Chapter 9).

4.3 Logistic Regression

4.3.1 The Logistic Model

logistic function:

$$p(X) = \frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}} (4.2) $$

where, $p(X) = Pr(Y=1|X)$

log-odds, or logit:

$$log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1X$$

4.3.2 Estimating the Regression Coefficients

maximum likelihood method: we try to find $\hat{β}_0$ and $\hat{β}_1$ such that plugging these estimates into the model for p(X), given in (4.2), yields a number close to one for all individuals who defaulted, and a number close to zero for all individuals who did not.

likelihood function:

$$l(\beta_0, \beta_1) = \prod _{i:y_i=1}p(x_i)+\prod _{i‘:y_{i‘}=0}(1-p(x_{i‘}))$$

4.3.3 Making Predictions

Once the coefficients have been estimated, it is a simple matter to compute the probability of default for any given credit card balance by using the formula

$$p(X) = \frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}$$

4.3.4 Multiple Logistic Regression

we can generalize the functions as

$$p(X) = \frac{e^{\beta_0+\beta_1X+...+\beta_pX_p}}{1+e^{\beta_0+\beta_1X+...+\beta_pX_p}}$$

$$log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1X+...+\beta_pX_p$$

4.4 Linear Discriminant Analysis

Why do we need another method, when we have logistic regression? There are several reasons:

  • When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
  • If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
  • linear discriminant analysis is popular when we have more than two response classes.

4.4.1 Using Bayes’ Theorem for Classification

Let $\pi_k$ represent the overall or prior probability that a randomly chosen observation comes from the kth class; this is the probability that a given observation is associated with the kth category of the response variable Y . Let $f_k(X) ≡ Pr(X = x|Y = k)$ denote the density function of X for an observation that comes from the kth class. Then Bayes‘ theorem states that

$$p_k(X) = Pr(Y=k|X=x) =\frac{ \pi_kf_k(x)}{\sum^{K}_{l=1}\pi_lf_l(x)}$$

We refer to $p_k(x)$ as the posterior probability that an observation X = x belongs to the kth class. That is, it is the probability that the observation belongs to the kth class, given the predictor value for that observation.

4.4.2 Linear Discriminant Analysis for p = 1

The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging estimates for $π_k$, $μ_k$, and $σ^2$ into

                

assign an observation X = x to the class for which $\hat \delta_k(x)$ is largest.

In particular, the following estimates are used:

                  

                  

4.4.3 Linear Discriminant Analysis for p > 1

Once again, we need to estimate the unknown parameters $μ_1,...,μ_K, π_1,...,π_K$, and $Σ$; the formulas are similar to those used in the one-dimensional case. To assign a new observation X = x, LDA plugs these estimates into (4.19) and classifies to the class for which $\hat \delta_k(x)$ is largest.

              (4.19)

A binary classifier such as this one can make two types of errors

  • it can incorrectly assign an individual who defaults to the no default category, or
  • it can incorrectly assign an individual who does not default to the default category.

A confusion matrix is a convenient way to display this information

            

A credit card company might particularly wish to avoid incorrectly classifying an individual who will default.

In the two-class case, this amounts to assigning an observation to the default class if $Pr(default = Yes|X = x)>0.5$

Instead, we could make the threshhold as $Pr(default = Yes|X = x)>0.2$

The ROC curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds. It is an acronym for receiver operating characteristics.

                  

The true positive rate is the sensitivity: the fraction of defaulters that are correctly identified, using a given threshold value. The false positive rate is 1-specificity: the fraction of non-defaulters that we classify incorrectly as defaulters, using that same threshold value. The ideal ROC curve hugs the top left corner, indicating a high true positive rate and a low false positive rate. The dotted line represents the “no information” classifier; this is what we would expect if student status and credit card balance are not associated with probability of default.

4.4.4 Quadratic Discriminant Analysis (QDA)

Unlike LDA, QDA assumes that each class has its own covariance matrix. That is, it assumes that an observation from the kth class is of the form X ~ N(μk,Σk), where Σk is a covariance matrix for the kth class.

                  

Bias-variance trade-off: roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial.

4.5 A Comparison of Classification Methods

  • both logistic regression and LDA produce linear decision boundaries.LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix in each class, and so can provide some improvements over logistic regression when this assumption approximately holds. Conversely, logistic regression can outperform LDA if these Gaussian assumptions are not met.
  • KNN is a completely non-parametric approach: no assumptions are made about the shape of the decision boundary. Therefore,we can expect this approach to dominate LDA and logistic regressionwhen the decision boundary is highly non-linear.
  • QDA serves as a compromise between the non-parametric KNN method and the linear LDA and logistic regression approaches. Though not as flexible as KNN, QDA can perform better in the presence of a limited number of training observations because it does make some assumptions about the form of the decision boundary.

In sum, when the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well. When the boundaries are moderately non-linear, QDA may give better results. Finally, for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully.

时间: 2024-10-20 12:45:31

ISL - Ch4. Classification的相关文章

Sentiment Analysis(1)-Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables

The content is from this paper: Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables, by Tetsuji Nakagawa. A typical approach for sentiment classification is to use supervised machine learning algorithms with bag-of-words a

Logistic Regression & Classification (1)

一.为什么不使用Linear Regression 一个简单的例子:如果训练集出现跨度很大的情况,容易造成误分类.如图所示,图中洋红色的直线为我们的假设函数 .我们假定,当该直线纵轴取值大于等于0.5时,判定Malignant为真,即y=1,恶性肿瘤:而当纵轴取值小于0.5时,判定为良性肿瘤,即y=0. 就洋红色直线而言,是在没有最右面的"×"的训练集,通过线性回归而产生的.因而这看上去做了很好的分类处理,但是,当训练集中加入了右侧的"×"之后,导致整个线性回归的结

Random Forest Classification of Mushrooms

There is a plethora of classification algorithms available to people who have a bit of coding experience and a set of data. A common machine learning method is the random forest, which is a good place to start. This is a use case in R of the randomFo

cdmc2016数据挖掘竞赛题目Android Malware Classification

http://www.csmining.org/cdmc2016/ Data Mining Tasks Description Task 1: 2016 e-News categorisation For this year, the dataset is sourced from 6 online news media: The New Zealand Herald (www.nzherald.co.nz), Reuters(www.reuters.com), The Times (www.t

Support Vector Machines for classification

Support Vector Machines for classification To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng: “SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learni

Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification

引入 Recently SVMs using spatial pyramid matching (SPM) kernel have been highly successful in image classification. Despite its popularity, these nonlinear SVMs have a complexity in training and O(n) in testing, where n is the training size, implying t

[notes] ImageNet Classification with Deep Convolutional Neual Network

Paper: ImageNet Classification with Deep Convolutional Neual Network Achievements: The model addressed by Alex etl. achieved top-1 and top-5 test error rate of 37.5% and 17.0% of classifying the 1.2 million high-resolution images in the ImageNet LSVR

机器学习笔记(Washington University)- Classification Specialization-week five

1. Ensemble classifier  Each classifier votes on prediction Ensemble model = sign(w1f1(xi) + w2f2(xi) + w3f3(xi)) w1 w2 w3 is the learning coefficients f1(xi), f2(xi), f3(xi)) is three classifiers 2. Boosting Focus on hard or more important pointsand

Logistic Regression‘s Cost Function & Classification (2)

一.为什么不用Linear Regression的Cost Function来衡量Logistic Regression的θ向量 回顾一下,线性回归的Cost Function为 我们使用Cost函数来简化上述公式: 那么通过上一篇文章,我们知道,在Logistic Regression中,我们的假设函数是sigmoid形式的,也就是: 这样一来会产生一个凸(convex)函数优化的问题,我们将g(z)带入到Cost函数中,得到的J(θ)是一个十分不规则的非凸函数,如图所示,如果使用梯度下降法来