Some further statements on KNN:
It
appears that k-nearest-neighbor fits have a single parameter, the number of neighbors k, compared to the p parameters in least-squares fits. Although this is the case, we will see that the effective number of parameters of k-nearest neighbors is N/k and is
generally bigger than p, and decreases with increasing k. To get an idea of why, note that if the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter (a mean) in each neighborhood.
N is the size of the training set, e.g. If k=1, each member in the training set is a mean value, we should store N values, but if k>1, for each sample in the input set, we have a neighbourhood containing k elements in the training
set, and if the neighbourhoods belonging to different members of the input set do not overlap, then we store N/k mean values.
When
we generate the following graph:
We
need a method to generate the test set. First we generated 10 means mk from a bivariate Gaussian distribution N((1, 0)T , I) and labeled this class BLUE. Similarly, 10 more were drawn from N((0, 1)T , I) and labeled class ORANGE. Then for each class we generated
100 observations as follows: for each observation, we picked an mk at random with probability 1/10, and then generated a N(mk, I/5), thus leading to a mixture of Gaussian clusters for each class.
Some
expansion on KNN:
To improve linear regression and KNN, we need to finish the following tasks:
1. Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0/1 weights used by k-nearest neighbors.
2. In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others.
3. Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
4. Linear models fit to a
basis expansion of the original inputs allow arbitrarily complex models.
The meaning of basis expansion can be explained as follows:
Then to introduce kernel based on basis expansion:
To minimize function
We get
To
expand the basis
Which has a similar form as SVM.
The use of kernel is to firstly guarantee that feature can be mapped to high dimensional spaces, secondly calculation can be simplified.
5.
Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.
Statistical
Decision Theory:
We seek a function f(X) for predicting Y given values of the input vector X. This
theory requires a loss function L(Y, f(X)) for penalizing errors in prediction, and by far the most common and convenient is squared error loss: L(Y, f(X)) = (Y ? f(X)) squared.
Our
aim is to choose f:
Provided a given X, we should make c closer to the label Y in the training set
The above equation gives us the exact c, and the solution is
The above x is value in the training set.
To apply the above theory into practice, we can use KNN, that is, for any input x, we calculate its statistical value by averaging its cloest k neighbors in the training set. It would seem that with a reasonably large set of
training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging, for the average value can approximate the statistical average value.
Local methods in high dimensions:
KNN breaks down in highdimensions,
and the phenomenon is commonly referred to as thecurse
of dimensionality.
Consider
the nearest-neighbor procedure for inputs uniformly distributed in a p-dimensional unit hypercube. Suppose we send out a hypercubical neighborhood about a target point to capture a fraction r of the observations. Since this corresponds to a fraction r of the
unit volume, r is a proportion and is less than 1. the expected edge length will be ep(r) = r^(1/p). In ten dimensions e10(0.01) = 0.63 and e10(0.1) = 0.80, while the entire range for each input is only 1.0. So to capture 1% or 10% of the data to form a local
average, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer “local.” Reducing r dramatically does not help much either, since the fewer observations we average, the higher is the variance of our fit.
Another consequence of the sparse sampling in high dimensions is that all sample points are close to an edge of the sample. Consider N data points (training samples) uniformly distributed in a p-dimensional unit ball
centered at the origin. Suppose we consider a nearest-neighbor estimate at the origin. The median distance from the origin to the closest data point is given by the expression
A more complicated expression exists for the mean distance to the closest point. For N = 500, p = 10 , d(p, N) ≈ 0.52, more than halfway to the boundary. Hence most data points are closer to the boundary of the sample space than
to any other data point. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. For those input samples that are nearer to the centering training samples, it is easier to find enough neighbors,
but for those nearer to boundary training samples, it is not.