Curse of Dimensionality

Curse of Dimensionality

Curse of Dimensionality refers to non-intuitive properties of data observed when working in high-dimensional space *, specifically related to usability and interpretation of distances and volumes. This is one of my favourite topics in Machine Learning and Statistics since it has broad applications (not specific to any machine learning method), it is very counter-intuitive and hence awe inspiring, it has profound application for any of analytics techniques, and it has ‘cool’ scary name like some Egyptian curse!
For quick grasp, consider this example: Say, you dropped a coin on a 100 meter line. How do you find it? Simple, just walk on the line and search. But what if it’s 100 x 100 sq. m. field? It’s already getting tough, trying to search a (roughly) football ground for a single coin. But what if it’s 100 x 100 x 100 cu.m space?! You know, football ground now has thirty-story height. Good luck finding a coin there! That, in essence is “curse of dimensionality”.

Many ML methods use Distance Measure

Most segmentation and clustering methods rely on computing distances between observations. Well known k-Means segmentation assigns points to nearest center. DBSCAN and Hierarchical clustering also required distance metrics. Distribution and density based outlier detectionalgorithms also make use of distance relative to other distances to mark outliers.

Supervised classification solutions like k-Nearest Neighbours method also use distance between observations to assign class to unknown observation. Support Vector Machine method involves transforming observations around select Kernels based on distance between observation and the kernel.

Common form of recommendation systems involve distance based similarity among user and item attribute vectors. Even when other forms of distances are used, number of dimensions plays a role in analytic design.

One of the most common distance metrics is Euclidian Distance metric, which is simply linear distance between two points in multi-dimensional hyper-space. Euclidian Distance for point i and point j in n dimensional space can be computed as:

Distance plays havoc in high-dimension

Consider simple process of data sampling. Suppose the black outside box in Fig. 1 is data universe with uniform distribution of data points across whole volume, and that we want to sample 1% of observations as enclosed by red inside box. Black box is hyper-cube in multi-dimensional space with each side representing range of value in that dimension. For simple 3-dimensional example in Fig. 1, we may have following range:

Figure 1 : Sampling

What is proportion of each range should we sample to obtain that 1% sample? For 2-dimensions, 10% of range will achieve overall 1% sampling, so we may select x∈(0,10) and y∈(0,50) and expect to capture 1% of all observations. This is because 10%2=1%. Do you expect this proportion to be higher or lower for 3-dimension?

Even though our search is now in additional direction, proportional actually increases to 21.5%. And not only increases, for just one additional dimension, it doubles! And you can see that we have to cover almost one-fifth of each dimension just to get one-hundredth of overall! In 10-dimensions, this proportion is 63% and in 100-dimensions – which is not uncommon number of dimensions in any real-life machine learning – one has to sample 95% of range along each dimension to sample 1% of observations! This mind-bending result happens because in high dimensions spread of data points becomes larger even if they are uniformly spread.

This has consequence in terms of design of experiment and sampling. Process becomes very computationally expensive, even to the extent that sampling asymptotically approaches population despite sample size remaining much smaller than population.

Consider another huge consequence of high dimensionality. Many algorithms measure distance between two data points to define some sort of near-ness (DBSCAN, Kernels, k-Nearest Neighbour) in reference to some pre-defined distance threshold. In 2-dimensions, we can imagine that two points are near if one falls within certain radius of another. Consider left image in Fig. 2. What’s share of uniformly spaced points within black square fall inside the red circle? That is about

Figure 2 : Near-ness

So if you fit biggest circle possible inside the square, you cover 78% of square. Yet, biggest sphere possible inside the cube covers only

of the volume. This volume reduces exponentially to 0.24% for just 10-dimension! What it essentially means that in high-dimensional world every single data point is at corners and nothing really is center of volume, or in other words, center volume reduces to nothing because there is (almost) no center! This has huge consequences of distance based clustering algorithms. All the distances start looking like same and any distance more or less than other is more random fluctuation in data rather than any measure of dissimilarity!

Fig. 3 shows randomly generated 2-D data and corresponding all-to-all distances. Coefficient of Variation in distance, computed as Standard Deviation divided by Mean, is 45.9%. Corresponding number of similarly generated 5-D data is 26.5% and for 10-D is 19.1%. Admittedly this is one sample, but trend supports the conclusion that in high-dimensions every distance is about same, and none is near or far!

Figure 3 : Distance Clustering

High-dimension affects other things too

Apart from distances and volumes, number of dimensions creates other practical problems. Solution run-time and system-memory requirements often non-linearly escalate with increase in number of dimensions. Due to exponential increase in feasible solutions, many optimization methods cannot reach global optima and have to make do with local optima. Further, instead of closed-form solution, optimization must use search based algorithms like gradient descent, genetic algorithm and simulated annealing. More dimensions introduce possibility of correlation and parameter estimation can become difficult in regression approaches.

Dealing with High-dimension

This will be separate blog post in itself, but correlation analysis, clustering, information value, variance inflation factor, principal component analysis are some of the ways in which number of dimensions can be reduced.

* Number of variables, observations or features a data point is made up of is called dimension of data. For instance, any point in space can be represented using 3 co-ordinates of length, breadth, and height, and has 3 dimensions

Other Articles by the same author:

Understanding and Creating Decision Tree

Decision Trees: Development & Scoring

Other Related Links that you may like:

Overview of Text Mining

Analytics and Modeling

时间: 2024-08-26 00:09:04

Curse of Dimensionality的相关文章

CCJ PRML Study Note - Chapter 1.3-1.4 : Model Selection & the Curse of Dimensionality

Chapter 1.3-1.4 : Model Selection & the Curse of Dimensionality Chapter 1.3-1.4 : Model Selection & the Curse of Dimensionality Christopher M. Bishop, PRML, Chapter 1 Introdcution 1. Model Selection In our example of polynomial curve fitting using

【PRML读书笔记-Chapter1-Introduction】1.4 The Curse of Dimensionality

维数灾难 给定如下分类问题: 其中x6和x7表示横轴和竖轴(即两个measurements),怎么分? 方法一(simple): 把整个图分成:16个格,当给定一个新的点的时候,就数他所在的格子中,哪种颜色的点最多,最多的点就是最有可能的. 如图: 显然,这种方法是有缺陷的: 例子给出的是2维的,那么3维的话,就是一个立体的空间,如下图所示: 因为我们生活在3维的世界里,所以我们很容易接受3维.比如,我们考虑一个在D维环境下,半径为1和半径为1-的球体的容积之差: 他们的差即为: volume

对The Curse of Dimensionality(维度灾难)的理解

一个特性:低维(特征少)转向高维的过程中,样本会变的稀疏(可以有两种理解方式:1.样本数目不变,样本彼此之间距离增大.2.样本密度不变,所需的样本数目指数倍增长). 高维度带来的影响: 1.变得可分. 由于变得稀疏,之前低维不可分的,在合适的高维度下可以找到一个可分的超平面. 2.过拟合风险. 过高维度会带来过拟合的风险(会学习到数据集中的特例或异常,对现实测试数据效果较差).增加维度的线性模型等效于低维空间里较复杂的非线性分类器. 3.需要更多训练数据.我们需要更多的训练数据进行参数估计. 4

[读书笔记]机器学习:实用案例解析(7)

第7章  优化:密码破译 优化简介:最优点(optimum),优化(optimization) 本章研究的问题:构建一个简单的密码破译系统,把解密一串密文当做一个优化问题. 优化方法:网格搜索(grid search),主要问题是1.步长的选择:2.维度灾难(Curse of Dimensionality):问题规模过大 optim函数:比网格搜索更快,可以通过已经计算出的信息推断出下一步的方向,同时对所有变量一起优化.(根据书中后文,可能的原理是根据导数得出下一步的进行方向,因为该函数对于不可

决策理论(Decision theory)&自动规划和调度(Automated planning and scheduling)(双语)

译的不好,还请见谅... 大部分内容来自wiki decision theory决策理论部分: Normative and descriptive decision theory 规范和描述性决策理论 规范或规范的决策理论关心的是确定最好的决定(在实践中,有些情况下,"最好"的不一定是最大,最优可能还包括值除了最大,但在特定或近似范围),假设一个理想的决策者充分了解,能够准确无误地计算,完全理性的.这说明性的方法的实际应用(人们应该做出决定)决策分析,旨在发现工具,方法和软件帮助人们做

FastText总结,fastText 源码分析

文本分类单层网络就够了.非线性的问题用多层的. fasttext有一个有监督的模式,但是模型等同于cbow,只是target变成了label而不是word. fastText有两个可说的地方:1 在word2vec的基础上, 把Ngrams也当做词训练word2vec模型, 最终每个词的vector将由这个词的Ngrams得出. 这个改进能提升模型对morphology的效果, 即"字面上"相似的词语distance也会小一些. 有人在question-words数据集上跑过fastT

MLlearning(1)——kNN算法

这篇文章讲kNN(k近邻,k-Nearest Neighbour).这是一种lazy-learning,实现方便,很常用的分类方法.约定n为样本集中的样本数,m为样本的维度,则这个算法的训练复杂度为0,未加优化(线性扫描)的分类时间复杂度为,kd-Tree优化后复杂度可降为. 思路.优点及缺陷 该方法的思路是:如果一个样本在特征空间中的 k 个最相似即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别.kNN 算法中,所选择的邻居都是已经正确分类的对象.该方法在分类决策上只

Why Deep Learning Works – Key Insights and Saddle Points

Why Deep Learning Works – Key Insights and Saddle Points A quality discussion on the theoretical motivations for deep learning, including distributed representation, deep architecture, and the easily escapable saddle point. By Matthew Mayo. This post

今天开始学习模式识别与机器学习Pattern Recognition and Machine Learning (PRML),章节5.1,Neural Networks神经网络-前向网络。

话说上一次写这个笔记是13年的事情了···那时候忙着实习,找工作,毕业什么的就没写下去了,现在工作了有半年时间也算稳定了,我会继续把这个笔记写完.其实很多章节都看了,不过还没写出来,先从第5章开始吧,第2-4章比较基础,以后再补! 第5章 Neural Networks 在第3章和第4章,我们已经学过线性的回归和分类模型,这些模型由固定的基函数(basis functions)的线性组合组成.这样的模型具有有用的解析和计算特性,但是因为维度灾难(the curse of dimensionali