机器学习笔记(Washington University)- Clustering Specialization-week one & week two

,1.One nearest neighbor

Input:

  Query article: Xq

  Corpus of documents (N docs): (X1, X2, X3,... ,XN)

output :

  XNN = min disance(Xq, Xi)

2. K-NN Algorithm

#list of sorted distance
smallest_distances = sort(σ1, σ2...σk)
#list of sorted distance
closest_articles = sort(book1,  book2...bookK)

For i=k+1,...,N
compute σ = distance(book1 ,bookq)
if σ < smallest_distances[k]:

   find j such that σ  > smallest_distances[j-1] but  σ  < smallest_distances[j]

  remove furthest article and shift queue:

    closest_articles[j+1:k] =  closest_articles[j:k-1]

    smallest_distances [j+1:k] =  smallest_distances [j:k-1]

  set smallest_distances[j] = σ and smallest_distances[j]  = booki

return k most similar articles

3. Document representation

i. Bag of words model

  • Ignore order of words
  • Count number of instances of each word in vocabulary

Issue:

  • Common words in the doc like "the" dominate rare words llike "football"

ii. TF-IDF document representation

  • emphasize important words which appear frequently in document  but rarely in corpus

And if defined as tf*idf :

(tf) Term frequency = word counts

(idf) Inverse doc frequency = log(number of docs/(1+number of docs which are using this word))

4. Distance metric

i.Weight different features:

  • some features are relevant than others
  • some features vary more than others

Specify weights as a function of feature spread

For feature j:

  1 / (max(xi[j]) - min(Xi[j]))

Scaled euclidean distance:

and A is the diagonal matrix where a1, a2, ..., ad is the diagonal

ii. Similarity

the inner product between two article vectors,

  • it is not a proper  distance metric, distance = 1 - similarity
  • efficient to compute for sparse vecs

Normalized similarity is defined as:

Theta is the angle between those two vector.

We can normalized the similarity to rule out the effect of the length of an article,

but normalizing can make dissimilar objects appear more similar like tweet vs long article, we can cap maximum word counts to cure this.

5. Complexity of brute-force search of KNN

O(N) distance computations per 1-NN query

O(NlogK) per k-NN query (maintain a priority queue and insert into it in an efficient manner)

6. KD-trees

  • Strucured organization of documents: recursively partitions points into axis aligned boxes
  • enable more efficient pruning of search space
  • works wll in low-medium dimensions

which dimensin do we split along?

  • widest (or alternate)

which value do we split at?

  • median or center point of box(ignoring data in box)

when do we stop?

  • fewer than m points left or box hits minimum width

7. KNN with KD-trees

  • start by exploring leaf node containing query poing
  • compute distance to each other point at leaf node
  • backtracke and try other branch at each node visited
  • use distance bound and bounding box of each node to prune parts of tree that cannot include nearest neightbor (distance to bounding box > distance to nearest neighbor sofar)

Complexity of this algorithm

Construction:

  size = 2N - 1 odes if 1 datapoint at each leaf  O(N)

  depth = Olog(N)

  median + send points left right = O(N) at every level of the tree

  Construction time = depth * cost at each level = O(Nlog(N))

1-NN Query:

  Traverse down tree to starting point = O(log(N))

  Maximum backtrack and traverse = O(N) in worst case

  Complexity range: O(log(N)) - O(N) but exponential in d (dimensions)

for k queries, original KNN method = O(N2) ,using KD trees = O(Nlog(N)) - O(N2)

In high dimensions:

  • most dimensions are just noise, everything is far away. and the search radius is likely to intersect many hypercubes in at least on dim
  • not many nodes can be pruned
  • need technique to learn which features are important to given task

8. Approximate K-NN with KD-trees

Before: Prune when distance to bounding box > r

Now: Prune when distance to bounding box > r/α

so we can prune more  than allowed

9. Locality sensitive hashing

simple "binning" of data into multiple bins, and we only search points within the bin, fewer points in the bin

  • Draw h random lines
  • Compute score for each point under each line and translate to binary index
  • use h-bit binary vector per data point as bin index
  • create hash table
  • for each query point x, search bin(x) then searching neighboring bins

Each line can split points, so we are sacrificing accuray for speed.

but we can search neighboring bins(filp 1 bit, until computational budget is reached).

In high dimensions, we just draw random planes and per daa point, need d multiplies to determine bin index per plane.

    

时间: 2024-10-19 08:20:35

机器学习笔记(Washington University)- Clustering Specialization-week one & week two的相关文章

Stanford机器学习笔记-9. 聚类(Clustering)

9. Clustering Content 9. Clustering 9.1 Supervised Learning and Unsupervised Learning 9.2 K-means algorithm 9.3 Optimization objective 9.4 Random Initialization 9.5 Choosing the Number of Clusters 9.1 Supervised Learning and Unsupervised Learning 我们已

机器学习笔记

下载链接:斯坦福机器学习笔记 这一系列笔记整理于2013年11月至2014年7月.所有内容均是个人理解,做笔记的原因是为了以后回顾相应方法时能快速记起,理解错误在所难免,不合适的地方敬请指正. 笔记按照斯坦福机器学习公开课的notes整理,其中online学习部分没有整理,reinforcement learning还没接触,有时间补上. 这份笔记主要记录自己学习过程中理解上的难点,所以对于初学者来说可能不容易理解,更详细和全面的说明可以参照JerryLead等的机器学习博文. 水哥@howde

机器学习笔记(1)

今天按照<机器学习实战>学习 k-邻近算法,输入KNN.classify0([0,0],group,labels,3)的时候总是报如下的错误: Traceback (most recent call last): File "<pyshell#75>", line 1, in <module> KNN.classify0([0,0],group,labels,3) File "KNN.py", line 16, in classi

机器学习笔记——K-means

K-means是一种聚类算法,其要求用户设定聚类个数k作为输入参数,因此,在运行此算法前,需要估计需要的簇的个数. 假设有n个点,需要聚到k个簇中.K-means算法首先从包含k个中心点的初始集合开始,即随机初始化簇的中心.随后,算法进行多次迭代处理并调整中心位置,知道达到最大迭代次数或中性收敛于固定点. k-means聚类实例.选择三个随机点用作聚类中心(左上),map阶段(右上)将每个点赋给离其最近的簇.在reduce阶段(左下),取相互关联的点的均值,作为新的簇的中心位置,得到本轮迭代的最

机器学习笔记 贝叶斯学习(上)

机器学习笔记(一) 今天正式开始机器学习的学习了,为了激励自己学习,也为了分享心得,决定把自己的学习的经验发到网上来让大家一起分享. 贝叶斯学习 先说一个在著名的MLPP上看到的例子,来自于Josh Tenenbaum 的博士论文,名字叫做数字游戏. 用我自己的话叙述就是:为了决定谁洗碗,小明和老婆决定玩一个游戏.小明老婆首先确定一种数的性质C,比如说质数或者尾数为3:然后给出一系列此类数在1至100中的实例D= {x1,...,xN} :最后给出任意一个数x请小明来预测x是否在D中.如果小明猜

机器学习笔记——人工神经网络

人工神经网络(Artificial Neural Networks,ANN)提供了一种普遍而实用的方法从样例中学习值为实数.离散值或向量的函数. 人工神经网络由一系列简单的单元相互密集连接构成,其中每一个单元有一定数量的实值输入(可能是其他单元的输出),并产生单一的实数值输出(可能成为其他单元的输入). 适合神经网络学习的问题: 实例是很多"属性-值"对表示的 目标函数的输出可能是离散值.实数值或者由若干实数或离散属性组成的向量 训练数据可能包含错误 可容忍长时间的训练 可能需要快速求

机器学习笔记04:逻辑回归(Logistic regression)、分类(Classification)

之前我们已经大概学习了用线性回归(Linear Regression)来解决一些预测问题,详见: 1.<机器学习笔记01:线性回归(Linear Regression)和梯度下降(Gradient Decent)> 2.<机器学习笔记02:多元线性回归.梯度下降和Normal equation> 3.<机器学习笔记03:Normal equation及其与梯度下降的比较> 说明:本文章所有图片均属于Stanford机器学课程,转载请注明出处 面对一些类似回归问题,我们可

机器学习笔记之基础概念

本文基本按照<统计学习方法>中第一章的顺序来写,目录如下: 1. 监督学习与非监督学习 2. 统计学习三要素 3. 过拟合与正则化(L1.L2) 4. 交叉验证 5. 泛化能力 6. 生成模型与判别模型 7. 机器学习主要问题 8. 提问 正文: 1. 监督学习与非监督学习 从标注数据中学习知识的规律以及训练模型的方法叫做监督学习,但由于标注数据获取成本较高,训练数据的数量往往不够,所以就有了从非标注数据,也就是非监督数据中学习的方法. 由于非监督数据更容易获取,所以非监督学习方法更适合于互联

cs229 斯坦福机器学习笔记(一)

前言 说到机器学习,很多人推荐的学习资料就是斯坦福Andrew Ng的cs229,有相关的视频和讲义.不过好的资料 != 好入门的资料,Andrew Ng在coursera有另外一个机器学习课程,更适合入门.课程有video,review questions和programing exercises,视频虽然没有中文字幕,不过看演示的讲义还是很好理解的(如果当初大学里的课有这么好,我也不至于毕业后成为文盲..).最重要的就是里面的programing exercises,得理解透才完成得来的,毕

机器学习笔记——SVM之一

SVM(Support Vector Machine),中文名为 支持向量机,就像自动机一样,听起来异常神气,最初总是纠结于不是机器怎么能叫"机",后来才知道其实此处的"机"实际上是算法的意思. 支持向量机一般用于分类,基本上,在我的理解范围内,所有的机器学习问题都是分类问题.而据说,SVM是效果最好而成本最低的分类算法. SVM是从线性可分的情况下最优分类面发展而来的,其基本思想可以用下图表示: (最优分类面示意图) 图中空心点和实心点代表两类数据样本,H为分类线