统计学习笔记(3) 监督学习概论(3)

Some further statements on KNN:

It
appears that k-nearest-neighbor fits have a single parameter, the number of neighbors k, compared to the p parameters in least-squares fits. Although this is the case, we will see that the effective number of parameters of k-nearest neighbors is N/k and is
generally bigger than p, and decreases with increasing k. To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter (a mean) in each neighborhood.

N is the size of the training set, e.g. If k=1, each member in the training set is a mean value, we should store N values, but if k>1, for each sample in the input set, we have a neighbourhood containing k elements in the training
set, and if the neighbourhoods belonging to different members of the input set do not overlap, then we store N/k mean values.

When
we generate the following graph:

We
need a method to generate the test set. First we generated 10 means mk  from a bivariate Gaussian distribution N((1, 0)T , I) and labeled this class BLUE. Similarly, 10 more were drawn from N((0, 1)T , I) and labeled class ORANGE. Then for each class we generated
100 observations as follows: for each observation, we picked an mk at random with probability 1/10, and then generated a N(mk, I/5), thus leading to a mixture of Gaussian clusters for each class.

Some
expansion on KNN:

To improve linear regression and KNN, we need to finish the following tasks:

1. Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0/1 weights used by k-nearest neighbors.

2. In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others.

3. Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.

4. Linear models fit to a
basis expansion of the original inputs allow arbitrarily complex models.

The meaning of basis expansion can be explained as follows:

Then to introduce kernel based on basis expansion:

To minimize function

We get

To
expand the basis

Which has a similar form as SVM.

The use of kernel is to firstly guarantee that feature can be mapped to high dimensional spaces, secondly calculation can be simplified.

5.
Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.

Statistical
Decision Theory:

We seek a function f(X) for predicting Y given values of the input vector X. This
theory requires a loss function L(Y, f(X)) for penalizing errors in prediction, and by far the most common and convenient is squared error loss: L(Y, f(X)) = (Y ? f(X)) squared.

Our
aim is to choose f:

Provided a given X, we should make c closer to the label Y in the training set

The above equation gives us the exact c, and the solution is

The above x is value in the training set.

To apply the above theory into practice, we can use KNN, that is, for any input x, we calculate its statistical value by averaging its cloest k neighbors in the training set. It would seem that with a reasonably large set of
training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging, for the average value can approximate the statistical average value.

Local methods in high dimensions:

KNN breaks down in highdimensions,
and the phenomenon is commonly referred to as thecurse
of dimensionality.

Consider
the nearest-neighbor procedure for inputs uniformly distributed in a p-dimensional unit hypercube. Suppose we send out a hypercubical neighborhood about a target point to capture a fraction r of the observations. Since this corresponds to a fraction r of the
unit volume, r is a proportion and is less than 1. the expected edge length will be ep(r) = r^(1/p). In ten dimensions e10(0.01) = 0.63 and e10(0.1) = 0.80, while the entire range for each input is only 1.0. So to capture 1% or 10% of the data to form a local
average, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer “local.” Reducing r dramatically does not help much either, since the fewer observations we average, the higher is the variance of our fit.

Another consequence of the sparse sampling in high dimensions is that all sample points are close to an edge of the sample. Consider N data points (training samples) uniformly distributed in a p-dimensional unit ball
centered at the origin. Suppose we consider a nearest-neighbor estimate at the origin. The median distance from the origin to the closest data point is given by the expression

A more complicated expression exists for the mean distance to the closest point. For N = 500, p = 10 , d(p, N) ≈ 0.52, more than halfway to the boundary. Hence most data points are closer to the boundary of the sample space than
to any other data point. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. For those input samples that are nearer to the centering training samples, it is easier to find enough neighbors,
but for those nearer to boundary training samples, it is not.

时间: 2024-08-03 11:07:29

统计学习笔记(3) 监督学习概论(3)的相关文章

统计学习笔记之支持向量机

支持向量机(SVM)是一种二分类模型,跟之前介绍的感知机有联系但也有区别.简单来讲,感知机仅仅是找到了一个平面分离正负类的点,意味着它是没有任何约束性质的,可以有无穷多个解,但是(线性可分)支持向量机和感知机的区别在于,支持向量机有一个约束条件,即利用间隔最大化求最优分离超平面,这时,支持向量机的解就是唯一存在的. 首先来看线性可分的支持向量机,对于给定的数据集,需要学习得到的分离超平面为: 以及对应的分类决策函数: 一般而言,一个点距离分离超平面的远近可以表示分类预测的确信程度.如果超平面确定

统计学习笔记之决策树(二)

1.CART分类树的特征选择 分类问题中,假设有K个类,样本点属于第k类的概率为,则概率分布的基尼指数定义为: 如果,集合D根据特征A是否取某一可能值a被分割成和,在特征A的条件下,集合D的基尼指数定义为: 基尼指数代表了模型的不纯度,基尼指数越小,不纯度越小,特征越好. 2.CART分类树的生成算法 输入:训练数据集D,停止计算条件; 输出:CART决策树. 根据训练数据集,从根结点开始,递归的对每个结点进行以下操作,构建二叉树: (1)计算现有特征对该数据集的基尼指数; (2)在所有可能的特

统计学习笔记之逻辑回归

在分类的问题中,要预测的变量y经常是离散的,如需要预测是正确还是错误,这是一种最基本的二分类.当然,逻辑回归也可以进行多分类,有一种简单的方法是,将其中一类标记为正类,剩余类标记为负类,可以得到正类,再讲另外一个类标记为正类,重复进行既可得到多分类的结果. LR的常规步骤: 1.寻找假设函数 2.构造损失函数 3.使损失函数最小,并求得回归参数 对于二分类,输出标记为,而线性回归模型产生的预测值是实值,于是我们要将转换为0/1值.最理想的是单位阶跃函数,但是单位阶跃函数不连续不可微,于是,利用的

统计学习笔记(4) 线性回归(1)

Quantitative algorithm Error evaluation In this chapter, we review some of the key ideas underlying the linear regression model, as well as the least squares approach that is most commonly used to fit this model. Basic form: "≈" means "is a

第一章:统计学习及监督学习概论

目录 统计学习 基本分类 按模型分类 按算法分类 按技巧分类 三要素 模型 策略 算法 生成模型和判别模型 生成方法 判别方法 应用 习题 统计学习 对象:data 目的:预测和分析 方法 监督,无监督,强化学习 基本分类 监督学习 从标注数据中学习预测模型 建设\((X,Y)\)遵循联合概率分布\(P(X,Y)\), 样本独立同分布 假设空间:输入空间到输出空间映射的集合 无监督 \(X\)是输入空间,\(Z\)是隐式结构空间,学习\(z=g(x)\)或者\(P(z|x)\) 强化学习 半监督

统计学习方法笔记--监督学习

监督学习(supervised learning)的任务是学习一个模型,使模型能够对任意给定的输入,对其相应的输出做出一个好的预测,计算机的基本操作就是给定一个输入产生一个输出. 基本概念:输入空间.特征空间与输出空间 在监督学习中,将输入与输出所有可能取值的集合分别称为输入空间(input space)与输出空间(output space). 每个具体的输入是一个实例(instance),通常有特征向量(feature vector)表示.这时,所有特征向量存在的空间称为特征空间(featur

统计学习方法笔记(1)——统计学习方法概论

1.统计学习 统计学习是关于计算机基于数据构建概率统计模型并运用模型对数据进行预测与分析的一门学科,也称统计机器学习.统计学习是数据驱动的学科.统计学习是一门概率论.统计学.信息论.计算理论.最优化理论及计算机科学等多个领域的交叉学科. 统计学习的对象是数据,它从数据出发,提取数据的特征,抽象出数据的模型,发现数据中的知识,又回到对数据的分析与预测中去.统计学习关于数据的基本假设是同类数据具有一定的统计规律性,这是统计学习的前提. 统计学习的目的就是考虑学习什么样的模型和如何学习模型. 统计学习

机器学习-李航-统计学习方法学习笔记之感知机(2)

在机器学习-李航-统计学习方法学习笔记之感知机(1)中我们已经知道感知机的建模和其几何意义.相关推导也做了明确的推导.有了数学建模.我们要对模型进行计算. 感知机学习的目的是求的是一个能将正实例和负实例完全分开的分离超平面.也就是去求感知机模型中的参数w和b.学习策略也就是求解途径就是定义个经验损失函数,并将损失函数极小化.我们这儿采用的学习策略是求所有误分类点到超平面S的总距离.假设超平面s的误分类点集合为M,那么所有误分类点到超平面S的总距离为 显然损失函数L(w,b)是非负的,如果没有误分

?统计学习精要(The Elements of Statistical Learning)?课堂笔记(一)

前两天微博上转出来的,复旦计算机学院的吴立德吴老师在开?统计学习精要(The Elements of Statistical Learning)?这门课,还在张江...大牛的课怎能错过,果断请假去蹭课...为了减轻心理压力,还拉了一帮同事一起去听,eBay浩浩荡荡的十几人杀过去好不壮观!总感觉我们的人有超过复旦本身学生的阵势,五六十人的教室坐的满满当当,壮观啊. 这本书正好前阵子一直在看,所以才会屁颠屁颠的跑过去听.确实是一本深入浅出讲data mining models的好书.作者网站上提供免