KMeans and optimization

random sheme or say naive

input： k， set of n points

place k centroids at random locations 随机选

repeat the follow operations until convergence 重复到收敛

--for each point i：

找到k个中最近centroid j （距离公式）
将point i 放入cluster j中

--for each cluster j：

对此cluster j中的每个point计算所有的attribute的均值

（attribute不能是categorical or ordinal必须是numeric）

stop when none of the cluster assignments change 所有点不再改变cluster membership

O(iterations*k*n*dimensions) ,per interation:O(kn) memory O(k+n)
无法precache，每次迭代都会改变centroids

optimization

1 k-means++(using adaptive sampling scheme) ：slow but samll error ；随机选择：extremely fast,large error

2AFK-MC2： using Markov chain improving k-means++

AFK-MC2 改变seeding的方式

paper ：https://las.inf.ethz.ch/files/bachem16fast.pdf

Initial data points are states in the Mchain

a further data point is sampled to act as the candidate for the next state

randomized decision determines whether the Mchain transitions to the candidate or whether it remains old state

repeat and the last state returned as the initial cluster center

code

欧氏距离：np.lianlg.norm(a-b)
np.loadtxt(naem)
变量：

　　　　参数：epsilon =0 //threshold,minimun error used in stop condition

　　　　history_centroids = []

　　　　configuration记录：num_instances,num_features = dataset.shape

　　　　初始：prototype = dataset[np.random.randint(0,num_instances-1,size =k)]

　　　　np.ndarray num_instances个[]，每个[]中num_features个元素，存放centroid：prototypes_old = np.zeros(prototype.shape)

　　　　存放cluster：belongs_to=np.zeros((num_instances,1))

　　4. 迭代：
while norm>epsilon:

　　iteration+=1

　　norm = dist_method(prototype,prototype_old) //用来看是否停止，迭代前后的变化

　　for index_in,instance in enumrate(dataset):
　　　　dist_vec = np.zeros((k,1))

　　　　for index_prototype,prototype in enumrate(prototypes):

　　　　　　dist_vec[index_prototype] =dist_method[prototype,instance]

　　　　belongs_to[index_in,0]=np.argmin(dist_vec)

　　tmp_prototype = np.zeros((k,num_features))

　　for .....（cluster）

scaling n,k

sample and approximation approaches: 效果不好，当k增大分类更糟。

initial centroid selection：(seedling smarter): like ‘blaklist‘ 、‘Elkan‘s‘ 、‘Hamerly‘s‘ algorithm

blacklist algorithm

在data上建立一个tree，在所有centroid上迭代，排除一些。

setup cost O(nlgn) to build tree, computation worst:O(knlgn) , memory O(k+nlgn)

‘Elkan‘s‘

计算centroids之间距离，平衡points和centroid的距离来减少距离计算

no setup costs,worst O(k^2+kn) memory O(k^2+kn)

Dual-Tree k-means with bounded single-iteration runtime

paper: http://www.ratml.org/pub/pdf/2016dual.pdf

build two trees: query-treeT reference-tree Q　　T:一个instance task of查最近邻,保存点 Q:最近邻来自的set
同时traverse　　当访问(T.node,Q.node)一对时，看是否可剪，可则prune整个子树（可用于最近邻search， kernel density estimation， kernel conditional density estimation.....等等）
space tree:不是 space partitioning tree 允许nodes overlap。undirected acyclic rooted simple graph
1. 每个节点有许多points(0) 与一个父节点连接，许多子节点(0)
2. 根节点
3. 每个点至少被包含在一个树节点中
4. 每个节点有一个多维的凸子集(convex subset)包含着该节点中的所有点以及孩子节点所表示的convex subsets 即每个节点有bounding shape包含所有descendant points
traverse

访问pair(T Q节点的组合) no more than once并对combination计算给出score

if score>bound or infinite, the combination is pruned。否则计算Tnode的每个点和Qnode的每个点，而不是计算每个descendant point之间score

直到tree只有叶子的时候，call base case

！！：dual-tree algorithm = space tree+pruning dual-tree traversal+BaseCase() Score()

进一步理解见link

时间： 2024-11-05 18:43:16

KMeans and optimization的相关文章

K-means: optimization objective(最小化cost function来求相应的参数)

类似于linear regression,K-means算法也optimization objective或者是试图使cost function求最小值. 了解K-means算法的optimization objective有助于我们(1)调试算法时,看算法是否运行正确(在本节中可以看到)(2)使算法找到更好的cluster,避免局部最优解(在下节中会讲) K-means optimization objective uc(i):表示x(i)分给的那个cluster的cluster centro

斯坦福NG机器学习：K-means笔记

K-means 聚类算法: K-means聚类算法算法流程,我们首先有训练集,但是训练集我们没有类标签,我们想把数据聚类成一些cluster ,这是一种无监督学习方法.具体步骤:1. 首先初始化cluster centroid 2. 迭代的找每一个数据集点到最近cluster centroid,然后把该点给到最近cluster centroid所在的cluster,然后在更新cluster centroid 直到算法收敛. 算法也可如下图描述:分为两部分cluster assignment 和

Pose-Graph Optimization vs Bundle Adjustment

Pose-Graph Optimization和Bundle Adjustment是Visual Odometry中两种重要的优化方式. Pose-Graph Optimization 相机位置可以表示为一幅图像:“点”为相机位置,“边”为相机位置间的刚体运动. Cost function: 其中,eij表示边,Ci和Cj是点(即相机位置),Teij表示位置i和j间的变换.Pose-graph optimization寻找能使cost function达到最小的相机位置参数. Loop Cons

Coursera机器学习-第八周-Unsupervised Learning(K-Means)

Clustering K-means算法是硬聚类算法,是典型的基于原型的目标函数聚类方法的代表,它是数据点到原型的某种距离作为优化的目标函数,利用函数求极值的方法得到迭代运算的调整规则.K-means算法以欧式距离作为相似度测度,它是求对应某一初始聚类中心向量V最优分类,使得评价指标J最小.算法采用误差平方和准则函数作为聚类准则函数. Unsuperivised Learning:Intruduction 典型的Supervised Learning 有一组附标记(y(i))的训练数据集, 我们

【转】聚类算法-Kmeans算法的简单实现

1. 聚类与分类的区别: 首先要来了解的一个概念就是聚类,简单地说就是把相似的东西分到一组,同 Classification (分类)不同,对于一个 classifier ,通常需要你告诉它"这个东西被分为某某类"这样一些例子,理想情况下,一个 classifier 会从它得到的训练集中进行"学习",从而具备对未知数据进行分类的能力,这种提供训练数据的过程通常叫做 supervised learning (监督学习),而在聚类的时候,我们并不关心某一类是什么,我们需

EM算法（1）：K-means 算法

目录 EM算法(1):K-means 算法 EM算法(2):GMM训练算法 EM算法(3):EM算法详解 EM算法(1) : K-means算法 1. 简介 K-means算法是一类无监督的聚类算法,目的是将没有标签的数据分成若干个类,每一个类都是由相似的数据组成.这个类的个数一般是认为给定的. 2. 原理假设给定一个数据集$\mathbf{X} = \{\mathbf{x}_1, \mathbf{x}_2,...,\mathbf{x}_N \}$, 和类的个数K.我们的每个类都用一个中心点$

学习笔记：聚类算法Kmeans

前记 Kmeans是最简单的聚类算法之一,但是运用十分广泛,最近看到别人找实习笔试时有考到Kmeans,故复习一下顺手整理成一篇笔记.Kmeans的目标是:把n 个样本点划分到k 个类簇中,使得每个点都属于离它最近的质心对应的类簇,以之作为聚类的标准.质心,是指一个类簇内部所有样本点的均值. 算法描述 Step 1. 从数据集中随机选取K个点作为初始质心将每个点指派到最近的质心,形成k个类簇 Step 2. repeat 重新计算各个类簇的质心(即类内部

Convex optimization 凸优化

zh.wikipedia.org/wiki/凸優化以下问题都是凸优化问题,或可以通过改变变量而转化为凸优化问题:[5] 最小二乘线性规划线性约束的二次规划半正定规划 Convex function Convex minimization is a subfield of optimization that studies the problem of minimizing convex functions over convex sets. The convexity makes opt

K-Means聚类

聚类(clustering) 用于找出不带标签数据的相似性的算法 K-Means聚类算法简介与广义线性模型和决策树类似,K-Means参数的最优解也是以成本函数最小化为目标.K-Means成本函数公式如下: 成本函数是各个类畸变程度(distortions)之和.每个类的畸变程度等于该类重心与其内部成员位置距离的平方和.若类内部的成员彼此间越紧凑则类的畸变程度越小,反之,若类内部的成员彼此间越分散则类的畸变程度越大.求解成本函数最小化的参数就是一个重复配置每个类包含的观测值,并不断移动