- random sheme or say naive
input: k, set of n points
place k centroids at random locations 随机选
- repeat the follow operations until convergence 重复到收敛
--for each point i:
- 找到k个中最近centroid j (距离公式)
- 将point i 放入cluster j中
--for each cluster j:
- 对此cluster j中的每个point计算所有的attribute的均值
(attribute不能是categorical or ordinal必须是numeric)
- stop when none of the cluster assignments change 所有点不再改变cluster membership
- O(iterations*k*n*dimensions) ,per interation:O(kn) memory O(k+n)
- 无法precache,每次迭代都会改变centroids
- optimization
1 k-means++(using adaptive sampling scheme) :slow but samll error ; 随机选择:extremely fast,large error
2AFK-MC2: using Markov chain improving k-means++
- AFK-MC2 改变seeding的方式
paper :https://las.inf.ethz.ch/files/bachem16fast.pdf
Initial data points are states in the Mchain
a further data point is sampled to act as the candidate for the next state
randomized decision determines whether the Mchain transitions to the candidate or whether it remains old state
repeat and the last state returned as the initial cluster center
- code
- 欧氏距离:np.lianlg.norm(a-b)
- np.loadtxt(naem)
- 变量:
参数:epsilon =0 //threshold,minimun error used in stop condition
history_centroids = []
configuration记录:num_instances,num_features = dataset.shape
初始:prototype = dataset[np.random.randint(0,num_instances-1,size =k)]
np.ndarray num_instances个[],每个[]中num_features个元素,存放centroid:prototypes_old = np.zeros(prototype.shape)
存放cluster:belongs_to=np.zeros((num_instances,1))
4. 迭代:
while norm>epsilon:
iteration+=1
norm = dist_method(prototype,prototype_old) //用来看是否停止,迭代前后的变化
for index_in,instance in enumrate(dataset):
dist_vec = np.zeros((k,1))
for index_prototype,prototype in enumrate(prototypes):
dist_vec[index_prototype] =dist_method[prototype,instance]
belongs_to[index_in,0]=np.argmin(dist_vec)
tmp_prototype = np.zeros((k,num_features))
for .....(cluster)
- scaling n,k
sample and approximation approaches: 效果不好,当k增大分类更糟。
initial centroid selection:(seedling smarter): like ‘blaklist‘ 、‘Elkan‘s‘ 、‘Hamerly‘s‘ algorithm
- blacklist algorithm
在data上建立一个tree,在所有centroid上迭代,排除一些。
setup cost O(nlgn) to build tree, computation worst:O(knlgn) , memory O(k+nlgn)
- ‘Elkan‘s‘
计算centroids之间距离,平衡points和centroid的距离来减少距离计算
no setup costs,worst O(k^2+kn) memory O(k^2+kn)
- Dual-Tree k-means with bounded single-iteration runtime
paper: http://www.ratml.org/pub/pdf/2016dual.pdf
- build two trees: query-treeT reference-tree Q T:一个instance task of查最近邻,保存点 Q:最近邻来自的set
- 同时traverse 当访问(T.node,Q.node)一对时,看是否可剪,可则prune整个子树(可用于最近邻search, kernel density estimation, kernel conditional density estimation.....等等)
- space tree:不是 space partitioning tree 允许nodes overlap。undirected acyclic rooted simple graph
- 每个节点有许多points(0) 与一个父节点连接,许多子节点(0)
- 根节点
- 每个点至少被包含在一个树节点中
- 每个节点有一个多维的凸子集(convex subset)包含着该节点中的所有点以及孩子节点所表示的convex subsets 即每个节点有bounding shape包含所有descendant points
- traverse
访问pair(T Q节点的组合) no more than once并对combination计算给出score
if score>bound or infinite, the combination is pruned。否则计算Tnode的每个点和Qnode的每个点,而不是计算每个descendant point之间score
直到tree只有叶子的时候,call base case
!!:dual-tree algorithm = space tree+pruning dual-tree traversal+BaseCase() Score()
进一步理解见link