ISLR 4.6 Lab: Logistic Regression, LDA, QDA, and KNN

4.6.1 The Stock Market Data

> library (ISLR)
> names(Smarket )
[1] "Year" "Lag1" "Lag2" "Lag3" "Lag4"
[6] "Lag5" "Volume " "Today" " Direction "
> dim(Smarket )
[1] 1250 9

The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. The first command below gives an error message because the Direction variable is qualitative. 这个还挺有意思的

> cor(Smarket )
Error in cor(Smarket) : ‘x‘ must be numeric
> cor(Smarket [,-9])
             Year         Lag1         Lag2         Lag3         Lag4
Year   1.00000000  0.029699649  0.030596422  0.033194581  0.035688718
Lag1   0.02969965  1.000000000 -0.026294328 -0.010803402 -0.002985911
Lag2   0.03059642 -0.026294328  1.000000000 -0.025896670 -0.010853533
Lag3   0.03319458 -0.010803402 -0.025896670  1.000000000 -0.024051036
Lag4   0.03568872 -0.002985911 -0.010853533 -0.024051036  1.000000000
Lag5   0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
Volume 0.53900647  0.040909908 -0.043383215 -0.041823686 -0.048414246
Today  0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
               Lag5      Volume        Today
Year    0.029787995  0.53900647  0.030095229
Lag1   -0.005674606  0.04090991 -0.026155045
Lag2   -0.003557949 -0.04338321 -0.010250033
Lag3   -0.018808338 -0.04182369 -0.002447647
Lag4   -0.027083641 -0.04841425 -0.006899527
Lag5    1.000000000 -0.02200231 -0.034860083
Volume -0.022002315  1.00000000  0.014591823
Today  -0.034860083  0.01459182  1.000000000

4.6.2 Logistic Regression

The glm() function fits generalized glm() linear models, a class of models that includes logistic regression. The syntax
generalized of the glm() function is similar to that of lm(), except that we must pass in linear model the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

> glm.fit=glm(Direction∼Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket ,family =binomial )
> summary (glm.fit )

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
    Volume, family = binomial, data = Smarket)

Deviance Residuals:
   Min      1Q  Median      3Q     Max
-1.446  -1.203   1.065   1.145   1.326  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Lag1        -0.073074   0.050167  -1.457    0.145
Lag2        -0.042301   0.050086  -0.845    0.398
Lag3         0.011085   0.049939   0.222    0.824
Lag4         0.009359   0.049974   0.187    0.851
Lag5         0.010313   0.049511   0.208    0.835
Volume       0.135441   0.158360   0.855    0.392

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3

分析“

The smallest p-value here is associated with Lag1. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of 0.15, the p-value is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.

看具体的参数

coef() function in order to access just the coefficients for this fitted model. We can also use the summary() function to access particular aspects of the fitted model, such as the p-values for the coefficients.

> coef(glm.fit)
 (Intercept)         Lag1         Lag2         Lag3         Lag4
-0.126000257 -0.073073746 -0.042301344  0.011085108  0.009358938
        Lag5       Volume
 0.010313068  0.135440659
> summary (glm.fit )$coef
                Estimate Std. Error    z value  Pr(>|z|)
(Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
Lag1        -0.073073746 0.05016739 -1.4565986 0.1452272
Lag2        -0.042301344 0.05008605 -0.8445733 0.3983491
Lag3         0.011085108 0.04993854  0.2219750 0.8243333
Lag4         0.009358938 0.04997413  0.1872757 0.8514445
Lag5         0.010313068 0.04951146  0.2082966 0.8349974
Volume       0.135440659 0.15835970  0.8552723 0.3924004
> 

结果预测

The predict() function can be used to predict the probability that the market will go up, given values of the predictors.

The type="response" option tells R to output probabilities of the form P(Y = 1|X), as opposed to other information such as the logit.

> attach(Smarket)
>glm.probs= predict (glm.fit, type = "response")

In order to make a prediction as to whether the market will go up or
down on a particular day, we must convert these predicted probabilities into class labels, Up or Down.

> contrasts (Direction )
     Up
Down  0
Up    1

之后

The first command creates a vector of 1,250 Down elements. The second line transforms to Up all of the elements for which the predicted probability of a market increase exceeds 0.5. Given these predictions, the table() function table() can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.

> glm.pred=rep ("Down " ,1250)
> glm.pred[glm .probs >.5]=" Up"

> table(glm.pred ,Direction )
Direction
glm.pred Down Up
Up 457 507
Down 145 141

 

Cross validation create a held out data set of observations from 2005.

> train =(Year <2005)
> Smarket.2005= Smarket [! train ,]
> Direction.2005= Direction [! train]

now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset argument. We then obtain predicted probabilities of the stock market going up for each of the days in our test set—that is, for the days in 2005.

> glm.fit=glm(Direction∼Lag1+Lag2+Lag3+Lag4+Lag5+Volume ,          data=Smarket ,family =binomial ,subset =train )

混乱,不继续这部分了。

4.6.3 Linear Discriminant Analysis

Now we will perform LDA on the Smarket data. In R, we fit a LDA model using the lda() function, which is part of the MASS library.

> library (MASS)
> lda.fit=lda(Direction∼Lag1+Lag2 ,data=Smarket ,subset =train)
> lda.fit
Call:
lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

Coefficients of linear discriminants:
            LD1
Lag1 -0.6420190
Lag2 -0.5135293

The LDA output indicates that ˆπ1 = 0.492 and ˆπ2 = 0.508; in other words,49.2% of the training observations correspond to days during which the market went down. It also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of μk. These suggest that there is a tendency for the previous 2 days’ returns to be negative on days when the market increases, and a tendency for the previous days’ returns to be positive on days when the market declines. The coefficients of linear discriminants output provides the linear combination of Lag1 and Lag2 that are used to form the LDA decision rule.

If −0.642×Lag1−0.514×Lag2 is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline. The plot() function produces plots of the linear discriminants, obtained by computing −0.642 × Lag1 − 0.514 × Lag2 for each of the training observations. .

> lda.pred=predict (lda.fit , Smarket.2005)
> names(lda.pred)
[1] "class"     "posterior" "x"        

class, contains LDA’s predictions about the movement of the market.
The second element, posterior, is a matrix whose kth column contains the
posterior probability that the corresponding observation belongs to the kth
class, computed from (4.10). Finally, x contains the linear discriminants,
described earlier.

> lda.class =lda.pred$class
> table(lda.class ,Direction.2005)
         Direction.2005
lda.class Down  Up
     Down   35  35
     Up     76 106

4.6.4 Quadratic Discriminant Analysis

We will now fit a QDA model to the Smarket data. QDA is implemented in R using the qda() function, which is also part of the MASS library.

> qda.fit=qda(Direction∼Lag1+Lag2 ,data=Smarket ,subset =train)
> qda.fit
Call:
qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

The output contains the group means. But it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors. The predict() function works in exactly the same fashion as for LDA.

4.6.5 K-Nearest Neighbors

perform KNN using the knn() function, which is part of the  class library.

The function requires four inputs.

1. A matrix containing the predictors associated with the training data, labeled train.X below.
2. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below.
3. A vector containing the class labels for the training observations, labeled train.Direction (train.Y)below.
4. A value for K, the number of nearest neighbors to be used by the classifier.

We use the cbind() function, short for column bind, to bind the Lag1 and Lag2 variables together into two matrices, one for the training set and the other for the test set.

Seed

Now the knn() function can be used to predict the market’s movement for the dates in 2005. We set a random seed before we apply knn() because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results.

>  library (class)
> train.X=cbind(Lag1 ,Lag2)[train ,]
> test.X=cbind (Lag1 ,Lag2)[!train ,]
> train.Direction =Direction [train]
> set.seed (1)

> knn.pred=knn (train.X,test.X,train.Direction ,k=3)
> table(knn.pred ,Direction.2005)
Direction.2005
knn.pred Down Up
Down 48 54
Up 63 87

> mean(knn.pred== Direction.2005)
[1] 0.5357143

results are bac, QDA is the best for this type of data

4.6.6 An Application to Caravan Insurance Data

Caravan data set  includes 85 predictors that measure demographic characteristics for 5,822 individuals. The response variable is Purchase, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only 6% of people purchased caravan insurance.

Limitations on KNN

Because the KNN classifier predicts the class of a given test observation by
identifying the observations that are nearest to it, the scale of the variables
matters. Any variables that are on a large scale will have a much larger
effect on the distance between the observations, and hence on the KNN
classifier, than variables that are on a small scale.

As far as KNN is concerned, a difference of $1,000
in salary is enormous compared to a difference of 50 years in age. Consequently,
salary will drive the KNN classification results, and age will have
almost no effect.

A good way to handle this problem is to standardize the data so that all  variables are given a mean of zero and a standard deviation of one.  we exclude column 86, because that is the qualitative Purchase variable.

 standardized.X=scale(Caravan [,-86])

We now split the observations into a test set, containing the first 1,000
observations, and a training set, containing the remaining observations.
We fit a KNN model on the training data using K = 1, and evaluate its
performance on the test data.

> test =1:1000
> train.X=standardized.X[-test ,]
> test.X=standardized.X[test ,]
> train.Y=Purchase [-test]
> test.Y=Purchase [test]
> set.seed (1)
> knn.pred=knn (train.X,test.X,train.Y,k=1)
时间: 2024-08-26 00:37:37

ISLR 4.6 Lab: Logistic Regression, LDA, QDA, and KNN的相关文章

对Logistic Regression 的初步认识

线性回归 回归就是对已知公式的未知参数进行估计.比如已知公式是y=a∗x+b,未知参数是a和b,利用多真实的(x,y)训练数据对a和b的取值去自动估计.估计的方法是在给定训练样本点和已知的公式后,对于一个或多个未知参数,机器会自动枚举参数的所有可能取值,直到找到那个最符合样本点分布的参数(或参数组合).也就是给定训练样本,拟合参数的过程,对y= a*x + b来说这就是有一个特征x两个参数a b,多个样本的话比如y=a*x1+b*x2+...,用向量表示就是y =  ,就是n个特征,n个参数的拟

【转】Logistic regression (逻辑回归) 概述

Logistic regression (逻辑回归)是当前业界比较常用的机器学习方法,用于估计某种事物的可能性.比如某用户购买某商品的可能性,某病人患有某种疾病的可能性,以及某广告被用户点击的可能性等.(注意这里是:“可能性”,而非数学上的“概率”,logisitc回归的结果并非数学定义中的概率值,不可以直接当做概率值来用.该结果往往用于和其他特征值加权求和,而非直接相乘) 那么它究竟是什么样的一个东西,又有哪些适用情况和不适用情况呢?   一.官方定义: , Figure 1. The log

【转载】Logistic regression (逻辑回归) 概述

[原创]Logistic regression (逻辑回归) 概述 Logistic regression (逻辑回归)是当前业界比较常用的机器学习方法,用于估计某种事物的可能性.比如某用户购买某商品的可能性,某病人患有某种疾病的可能性,以及某广告被用户点击的可能性等.(注意这里是:“可能性”,而非数学上的“概率”,logisitc回归的结果并非数学定义中的概率值,不可以直接当做概率值来用.该结果往往用于和其他特征值加权求和,而非直接相乘) 那么它究竟是什么样的一个东西,又有哪些适用情况和不适用

Logistic Regression逻辑回归

参考自: http://blog.sina.com.cn/s/blog_74cf26810100ypzf.html http://blog.sina.com.cn/s/blog_64ecfc2f0101ranp.html ---------------------------------------------------------------------- Logistic regression (逻辑回归)是当前业界比较常用的机器学习方法,用于估计某种事物的可能性.比如某用户购买某商品的可

Logistic Regression & Classification (1)

一.为什么不使用Linear Regression 一个简单的例子:如果训练集出现跨度很大的情况,容易造成误分类.如图所示,图中洋红色的直线为我们的假设函数 .我们假定,当该直线纵轴取值大于等于0.5时,判定Malignant为真,即y=1,恶性肿瘤:而当纵轴取值小于0.5时,判定为良性肿瘤,即y=0. 就洋红色直线而言,是在没有最右面的"×"的训练集,通过线性回归而产生的.因而这看上去做了很好的分类处理,但是,当训练集中加入了右侧的"×"之后,导致整个线性回归的结

Coursera台大机器学习课程笔记9 -- Logistic Regression

这一节课主要讲如何用logistic regression做分类. 在误差衡量问题上,选取了最大似然函数误差函数,这一段推导是难点. 接下来是如何最小化Ein,采用的是梯度下降法,这个比较容易. 参考:http://beader.me/mlnotebook/section3/logistic-regression.html http://www.cnblogs.com/ymingjingr/p/4330304.html

logistic regression编程练习

本练习以<机器学习实战>为基础, 重现书中代码, 以达到熟悉算法应用为目的 1.梯度上升算法 新建一个logRegres.py文件, 在文件中添加如下代码: from numpy import * #加载模块 numpy def loadDataSet(): dataMat = []; labelMat = [] #加路径的话要写作:open('D:\\testSet.txt','r') 缺省为只读 fr = open('testSet.txt') #readlines()函数一次读取整个文件

最详细的基于R语言的Logistic Regression(Logistic回归)源码,包括拟合优度,Recall,Precision的计算

这篇日志也确实是有感而发,我对R不熟悉,但实验需要,所以简单学了一下.发现无论是网上无数的教程,还是书本上的示例,在讲Logistic Regression的时候就是给一个简单的函数及输出结果说明.从来都没有讲清楚几件事情: 1. 怎样用训练数据训练模型,然后在测试数据上进行验证(测试数据和训练数据可能有重合)? 2. 怎样计算预测的效果,也就是计算Recall,Precision,F-measure等值? 3. 怎样计算Nagelkerke拟合优度等评价指标? 发现这些书本和一些写博客的朋友,

深度学习 Deep LearningUFLDL 最新Tutorial 学习笔记 2:Logistic Regression

1 Logistic Regression 简述 Linear Regression 研究连续量的变化情况,而Logistic Regression则研究离散量的情况.简单地说就是对于推断一个训练样本是属于1还是0.那么非常easy地我们会想到概率,对,就是我们计算样本属于1的概率及属于0的概率,这样就能够依据概率来预计样本的情况,通过概率也将离散问题变成了连续问题. Specifically, we will try to learn a function of the form: P(y=1