ISLR 5.3 Lab: Cross-Validation and the Bootstrap

5.3.1 The Validation Set Approach

sample() function splits the set of observations into two halves, by selecting a random subset of 196 observations out of the original 392 observations.We refer to these observations as the training set.

> library (ISLR)
> set.seed (1)
> train=sample (392 ,196)

We then use the subset option in lm() to fit a linear regression using only the observations corresponding to the training set.

 lm.fit =lm(mpg∼horsepower ,data=Auto ,subset =train )

use the predict() function to estimate the response for all 392 observations, and we use the mean() function to calculate the MSE of the 196 observations in the validation set.

> attach (Auto)
> mean((mpg -predict (lm.fit ,Auto))[-train ]^2)
[1] 26.14142

use the poly() function to estimate the test error for the polynomial and cubic regressions.

> lm.fit2=lm(mpg∼poly(horsepower ,2) ,data=Auto ,subset =train )
> mean((mpg -predict (lm.fit2 ,Auto))[-train ]^2)
[1] 19.82259
> lm.fit3=lm(mpg∼poly(horsepower ,3) ,data=Auto ,subset =train )
> mean((mpg -predict (lm.fit3 ,Auto))[-train ]^2)
[1] 19.78252

5.3.2 Leave-One-Out Cross-Validation

In this lab, we will perform linear regression using the glm() function rather than the lm() function because
the latter can be used together with cv.glm(). The cv.glm() function is part of the boot library.

> library (boot)
> glm.fit=glm(mpg∼horsepower ,data=Auto)
> cv.err =cv.glm(Auto ,glm.fit)
> cv.err$delta
[1] 24.23151 24.23114

Our cross-validation estimate for the test error is approximately 24.23.

To automate the process, we use the for() function to initiate a for loop which iteratively fits polynomial regressions for polynomials of order i = 1 to i = 5, computes the associated cross-validation error, and stores it in the ith element of the vector cv.error. We begin by initializing the vector.

> for (i in 1:5){
+ glm.fit=glm(mpg∼poly(horsepower ,i),data=Auto)
+ cv.error[i]=cv.glm (Auto ,glm.fit)$delta [1]
+ }
> cv.error
[1] 24.23151 19.24821 19.33498 19.42443 19.03321

the trend in cv.error indicates how cv is used for prm selection

5.3.3 k-Fold Cross-Validation

The cv.glm() function can also be used to implement k-fold CV.

> cv.error.10= rep (0 ,10)
> for (i in 1:10) {
+ glm.fit=glm(mpg∼poly(horsepower ,i),data=Auto)
+ cv.error.10[i]=cv.glm (Auto ,glm.fit ,K=10) $delta [1]
+ }

5.3.4 The Bootstrap

 Estimating the Accuracy of a Statistic of Interest

first create a function, alpha.fn(), which takes as input the (X, Y) data as well as a vector indicating which observations should be used to estimate α. The function then outputs the estimate for α based on the selected observations.

following command tells R to estimate α using all 100 observations.

> alpha.fn=function (data ,index){
+ X=data$X [index]
+ Y=data$Y [index]
+ return ((var(Y)-cov (X,Y))/(var(X)+var(Y) -2* cov(X,Y)))
+ }

The next command uses the sample() function to randomly select 100 observations from the range 1 to 100, with replacement. This is equivalent to constructing a new bootstrap data set and recomputing ˆα based on the new data set.

alpha.fn(Portfolio ,sample (100 ,100 , replace =T))

We can implement a bootstrap analysis by performing this command many times, recording all of the corresponding estimates for α, and computing the resulting standard deviation. However, the boot() function automates this approach. Below we produce R = 1, 000 bootstrap estimates for α.

> boot(Portfolio ,alpha.fn,R=1000)

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = Portfolio, statistic = alpha.fn, R = 1000)

Bootstrap Statistics :
     original        bias    std. error
t1* 0.5758321 -7.315422e-05  0.08861826

The final output shows that using the original data, ˆα = 0.5758, and that the bootstrap estimate for SE(ˆα) is 0.0886.

Estimating the Accuracy of a Linear Regression Model

We first create a simple function, boot.fn(), which takes in the Auto data set as well as a set of indices for the observations, and returns the intercept

and slope estimates for the linear regression model. We then apply this function to the full set of 392 observations in order to compute the estimates of β0 and β1 on the entire data set using the usual linear regression coefficient estimate formulas from Chapter 3.

> boot.fn=function (data ,index )
+ return (coef(lm(mpg∼horsepower ,data=data ,subset =index)))
> boot.fn(Auto ,1:392)
(Intercept)  horsepower
 39.9358610  -0.1578447 

Next, we use the boot() function to compute the standard errors of 1,000 bootstrap estimates for the intercept and slope terms

> boot(Auto ,boot.fn ,1000)

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = Auto, statistic = boot.fn, R = 1000)

Bootstrap Statistics :
      original        bias    std. error
t1* 39.9358610  0.0269563085 0.859851825
t2* -0.1578447 -0.0002906457 0.007402954

Below we compute the bootstrap standard error estimates and the standard linear regression estimates that result from fitting the quadratic model to the data.

> boot.fn=function (data ,index )
+ coefficients(lm(mpg∼horsepower +I( horsepower ^2) ,data=data ,subset =index))
> set.seed (1)
> boot(Auto ,boot.fn ,1000)

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = Auto, statistic = boot.fn, R = 1000)

Bootstrap Statistics :
        original        bias     std. error
t1* 56.900099702  6.098115e-03 2.0944855842
t2* -0.466189630 -1.777108e-04 0.0334123802
t3*  0.001230536  1.324315e-06 0.0001208339
时间: 2024-07-28 21:38:51

ISLR 5.3 Lab: Cross-Validation and the Bootstrap的相关文章

ISLR 4.6 Lab: Logistic Regression, LDA, QDA, and KNN

4.6.1 The Stock Market Data > library (ISLR) > names(Smarket ) [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" [6] "Lag5" "Volume " "Today" " Direction " > dim(Smark

交叉验证(Cross Validation)原理小结

交叉验证是在机器学习建立模型和验证模型参数时常用的办法.交叉验证,顾名思义,就是重复的使用数据,把得到的样本数据进行切分,组合为不同的训练集和测试集,用训练集来训练模型,用测试集来评估模型预测的好坏.在此基础上可以得到多组不同的训练集和测试集,某次训练集中的某样本在下次可能成为测试集中的样本,即所谓"交叉". 那么什么时候才需要交叉验证呢?交叉验证用在数据不是很充足的时候.比如在我日常项目里面,对于普通适中问题,如果数据样本量小于一万条,我们就会采用交叉验证来训练优化选择模型.如果样本

R: Kriging interpolation and cross validation 克里金插值及交叉验证浅析

克里金插值的基本介绍可以参考ARCGIS的帮助文档[1]. 其本质就是根据已知点的数值,确定其周围点(预测点)的数值.最直观的方法就是找到已知点和预测点数值之间的关系,从而预测出预测点的数值.比如IDW插值方法,就是假设已知点和预测点的值跟它们相对距离成反比.克里金插值的精妙之处在于它不仅考虑了已知点和预测点的距离关系,还考虑了这些已知点之间的自相关关系. 如何衡量已知点之间的自相关关系呢?通常使用的就是半变异函数,其公式如下[1]: Semivariogram(distance h) = 0.

cross validation交叉验证

交叉验证是一种检测model是否overfit的方法.最常用的cross validation是k-fold cross validation. 具体的方法是: 1.将数据平均分成k份,0,1,2,,,k-1 2.使用1~k-1份数据训练模型,然后使用第0份数据进行验证. 3.然后将第1份数据作为验证数据.进行k个循环.就完成了k-fold cross validation 这个交叉验证的方法的特点是:所有的数据都参与了验证,也都参与了训练,没有浪费数据.

3.1.7. Cross validation of time series data

3.1.7. Cross validation of time series data Time series data is characterised by the correlation between observations that are near in time (autocorrelation). However, classical cross-validation techniques such as KFold and ShuffleSplit assume the sa

交叉验证的缺陷及改进(Cross Validation done wrong)

本文主要是对我们使用交叉验证可能出现的一个问题进行讨论,并提出修正方案. 本文地址:http://blog.csdn.net/shanglianlm/article/details/47207173 交叉验证(Cross validation)在统计学习中是用来估计你设计的算法精确度的一个极其重要的工具.本文主要展示我们在使用交叉验证时可能出现的一个问题,并提出修正的方法. 下面主要使用 Python scikit-learn 框架做演示. 先验理论(Theory first) 交叉验证将数据集

Cross Validation done wrong

Cross Validation done wrong Cross validation is an essential tool in statistical learning 1 to estimate the accuracy of your algorithm. Despite its great power it also exposes some fundamental risk when done wrong which may terribly bias your accurac

交叉验证(Cross Validation)

假设我们需要从某些候选模型中选择最适合某个学习问题的模型,我们该如何选择?以多元回归模型为例:,应该如何确定k的大小,使得该模型对解决相应的分类问题最为有效?如何在偏倚(bias)和方差(variance)之间寻求最佳的平衡点?更进一步,我们同样需要知道如何在加权回归模型中选择适当的波长参数,或者在基于范式的SVM模型中选择适当的参数C? 我们假设模型集合为有限集,我们的目的就是从这d个模型中,选择最有效的模型. 假设样本集为S,根据经验风险最小化原则(ERM),可能会使用这样的算法: 1.在S

关于K-fold cross validation 下不同的K的选择的疑惑?

在K-fold cross validation 下 比较不同的K的选择对于参数选择(模型参数,CV意义下的估计的泛化误差)以及实际泛化误差的影响.更一般的问题,在实际模型选择问题中,选择几重交叉验证比较合适? 交叉验证的背景知识: CV是用来验证模型假设(hypothesis)性能的一种统计分析方法,基本思想是在某种意义下将原始数据进行分组,一部分作为训练集,一部分作为验证集,使用训练集对每个hypothesis进行训练,再用验证集对每个hypothesis的性能进行评估,然后选取性能最好的h