K-Means 聚类算法

K-Means 概念定义:

K-Means 是一种基于距离的排他的聚类划分方法。

上面的 K-Means 描述中包含了几个概念:

  • 聚类(Clustering):K-Means 是一种聚类分析(Cluster Analysis)方法。聚类就是将数据对象分组成为多个类或者簇 (Cluster),使得在同一个簇中的对象之间具有较高的相似度,而不同簇中的对象差别较大。
  • 划分(Partitioning):聚类可以基于划分,也可以基于分层。划分即将对象划分成不同的簇,而分层是将对象分等级。
  • 排他(Exclusive):对于一个数据对象,只能被划分到一个簇中。如果一个数据对象可以被划分到多个簇中,则称为可重叠的(Overlapping)。
  • 距离(Distance):基于距离的聚类是将距离近的相似的对象聚在一起。基于概率分布模型的聚类是在一组对象中,找到能符合特定分布模型的对象的集合,他们不一定是距离最近的或者最相似的,而是能完美的呈现出概率分布模型所描述的模型。

K-Means 问题描述:

给定一个 n 个对象的数据集,它可以构建数据的 k 个划分,每个划分就是一个簇,并且 k ≤ n。同时还需满足:

  1. 每个组至少包含一个对象。
  2. 每个对象必须属于且仅属于一个簇。

Simply speaking, K-Means clustering is an algorithm to classify or to group your objects based on attributes/features, into K number of groups. K is a positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-means clustering is to classify the data.

K-Means 算法实现:

K-Means 算法最常见的实现方式是使用迭代式精化启发法的 Lloyd‘s algorithm

  • 给定划分数量 k。创建一个初始划分,从数据集中随机地选择 k 个对象,每个对象初始地代表了一个簇中心(Cluster Centroid)。对于其他对象,计算其与各个簇中心的距离,将它们划入距离最近的簇。
  • 采用迭代的重定位技术,尝试通过对象在划分间移动来改进划分。所谓重定位技术,就是当有新的对象加入簇或者已有对象离开簇的时候,重新计算簇的平均值,然后对对象进行重新分配。这个过程不断重复,直到各簇中对象不再变化为止。

K-Means 优缺点:

当结果簇是密集的,而且簇和簇之间的区别比较明显时,K-Means 的效果较好。对于大数据集,K-Means 是相对可伸缩的和高效的,它的复杂度是 O(nkt),n 是对象的个数,k 是簇的数目,t 是迭代的次数,通常 k << n,且 t << n,所以算法经常以局部最优结束。

K-Means 的最大问题是要求先给出 k 的个数。k 的选择一般基于经验值和多次实验结果,对于不同的数据集,k 的取值没有可借鉴性。另外,K-Means 对孤立点数据是敏感的,少量噪声数据就能对平均值造成极大的影响。

Lloyd‘s algorithm 伪码实现:

 1 var m = initialCentroids(x, K);
 2 var N = x.length;
 3
 4 while (!stoppingCriteria)
 5 {
 6     var w = [][];
 7
 8     // calculate membership in clusters
 9     for (var n = 1; n <= N; n++)
10     {
11         v = arg min (v0) dist(m[v0], x[n]);
12         w[v].push(n);
13     }
14
15     // recompute the centroids
16     for (var k = 1; k <= K; k++)
17     {
18         m[k] = avg(x in w[k]);
19     }
20 }
21
22 return m;

Lloyd‘s algorithm C# 代码实现:

Code below referenced from Machine Learning Using C# Succinctly by James McCaffrey, and article K-Means Data Clustering Using C#.

  1 using System;
  2
  3 // K-means clustering demo. (‘Lloyd‘s algorithm‘)
  4 // Coded using static methods. Normal error-checking removed for clarity.
  5 // This code can be used in at least two ways.
  6 // You can do a copy-paste and then insert the code into some system.
  7 // Or you can wrap the code up in a Class Library.
  8 // The single public method is Cluster().
  9
 10 namespace ClusteringKMeans
 11 {
 12   class Program
 13   {
 14     static void Main(string[] args)
 15     {
 16       Console.WriteLine("\nBegin k-means clustering demo\n");
 17
 18       // real data likely to come from a text file or SQL
 19       double[][] raw = new double[20][];
 20       raw[0] = new double[] { 65.0, 220.0 };
 21       raw[1] = new double[] { 73.0, 160.0 };
 22       raw[2] = new double[] { 59.0, 110.0 };
 23       raw[3] = new double[] { 61.0, 120.0 };
 24       raw[4] = new double[] { 75.0, 150.0 };
 25       raw[5] = new double[] { 67.0, 240.0 };
 26       raw[6] = new double[] { 68.0, 230.0 };
 27       raw[7] = new double[] { 70.0, 220.0 };
 28       raw[8] = new double[] { 62.0, 130.0 };
 29       raw[9] = new double[] { 66.0, 210.0 };
 30       raw[10] = new double[] { 77.0, 190.0 };
 31       raw[11] = new double[] { 75.0, 180.0 };
 32       raw[12] = new double[] { 74.0, 170.0 };
 33       raw[13] = new double[] { 70.0, 210.0 };
 34       raw[14] = new double[] { 61.0, 110.0 };
 35       raw[15] = new double[] { 58.0, 100.0 };
 36       raw[16] = new double[] { 66.0, 230.0 };
 37       raw[17] = new double[] { 59.0, 120.0 };
 38       raw[18] = new double[] { 68.0, 210.0 };
 39       raw[19] = new double[] { 61.0, 130.0 };
 40
 41       Console.WriteLine("Raw un-clustered data:\n");
 42       Console.WriteLine("    Height Weight");
 43       Console.WriteLine("-------------------");
 44       ShowData(raw, 1, true, true);
 45
 46       int k = 3;
 47       Console.WriteLine("\nSetting k to " + k);
 48
 49       int[] clustering = Cluster(raw, k); // this is it
 50
 51       Console.WriteLine("\nK-means clustering complete\n");
 52
 53       Console.WriteLine("Final clustering in internal form:\n");
 54       ShowVector(clustering, true);
 55
 56       Console.WriteLine("Raw data by cluster:\n");
 57       ShowClustered(raw, clustering, k, 1);
 58
 59       Console.WriteLine("\nEnd k-means clustering demo\n");
 60       Console.ReadLine();
 61     }
 62
 63     public static int[] Cluster(double[][] rawData, int k)
 64     {
 65       // k-means clustering
 66       // index of return is tuple ID, cell is cluster ID
 67       // ex: [2 1 0 0 2 2] means tuple 0 is cluster 2,
 68       // tuple 1 is cluster 1, tuple 2 is cluster 0, tuple 3 is cluster 0, etc.
 69       // an alternative clustering DS to save space is to use the .NET BitArray class
 70       double[][] data = Normalized(rawData); // so large values don‘t dominate
 71
 72       bool changed = true; // was there a change in at least one cluster assignment?
 73       bool success = true; // were all means able to be computed? (no zero-count clusters)
 74
 75       // init clustering[] to get things started
 76       // an alternative is to initialize means to randomly selected tuples
 77       // then the processing loop is
 78       // loop
 79       //    update clustering
 80       //    update means
 81       // end loop
 82       int[] clustering = InitClustering(data.Length, k, 0); // semi-random initialization
 83       double[][] means = Allocate(k, data[0].Length); // small convenience
 84
 85       int maxCount = data.Length * 10; // sanity check
 86       int ct = 0;
 87       while (changed == true && success == true && ct < maxCount)
 88       {
 89         ++ct; // k-means typically converges very quickly
 90         success = UpdateMeans(data, clustering, means); // compute new cluster means if possible. no effect if fail
 91         changed = UpdateClustering(data, clustering, means); // (re)assign tuples to clusters. no effect if fail
 92       }
 93       // consider adding means[][] as an out parameter - the final means could be computed
 94       // the final means are useful in some scenarios (e.g., discretization and RBF centroids)
 95       // and even though you can compute final means from final clustering, in some cases it
 96       // makes sense to return the means (at the expense of some method signature uglinesss)
 97       //
 98       // another alternative is to return, as an out parameter, some measure of cluster goodness
 99       // such as the average distance between cluster means, or the average distance between tuples in
100       // a cluster, or a weighted combination of both
101       return clustering;
102     }
103
104     private static double[][] Normalized(double[][] rawData)
105     {
106       // normalize raw data by computing (x - mean) / stddev
107       // primary alternative is min-max:
108       // v‘ = (v - min) / (max - min)
109
110       // make a copy of input data
111       double[][] result = new double[rawData.Length][];
112       for (int i = 0; i < rawData.Length; ++i)
113       {
114         result[i] = new double[rawData[i].Length];
115         Array.Copy(rawData[i], result[i], rawData[i].Length);
116       }
117
118       for (int j = 0; j < result[0].Length; ++j) // each col
119       {
120         double colSum = 0.0;
121         for (int i = 0; i < result.Length; ++i)
122           colSum += result[i][j];
123         double mean = colSum / result.Length;
124         double sum = 0.0;
125         for (int i = 0; i < result.Length; ++i)
126           sum += (result[i][j] - mean) * (result[i][j] - mean);
127         double sd = sum / result.Length;
128         for (int i = 0; i < result.Length; ++i)
129           result[i][j] = (result[i][j] - mean) / sd;
130       }
131       return result;
132     }
133
134     private static int[] InitClustering(int numTuples, int k, int randomSeed)
135     {
136       // init clustering semi-randomly (at least one tuple in each cluster)
137       // consider alternatives, especially k-means++ initialization,
138       // or instead of randomly assigning each tuple to a cluster, pick
139       // numClusters of the tuples as initial centroids/means then use
140       // those means to assign each tuple to an initial cluster.
141       Random random = new Random(randomSeed);
142       int[] clustering = new int[numTuples];
143       for (int i = 0; i < k; ++i) // make sure each cluster has at least one tuple
144         clustering[i] = i;
145       for (int i = k; i < clustering.Length; ++i)
146         clustering[i] = random.Next(0, k); // other assignments random
147       return clustering;
148     }
149
150     private static double[][] Allocate(int k, int numColumns)
151     {
152       // convenience matrix allocator for Cluster()
153       double[][] result = new double[k][];
154       for (int i = 0; i < k; ++i)
155         result[i] = new double[numColumns];
156       return result;
157     }
158
159     private static bool UpdateMeans(double[][] data, int[] clustering, double[][] means)
160     {
161       // returns false if there is a cluster that has no tuples assigned to it
162       // parameter means[][] is really a ref parameter
163
164       // check existing cluster counts
165       // can omit this check if InitClustering and UpdateClustering
166       // both guarantee at least one tuple in each cluster (usually true)
167       int numClusters = means.Length;
168       int[] clusterCounts = new int[numClusters];
169       for (int i = 0; i < data.Length; ++i)
170       {
171         int cluster = clustering[i];
172         ++clusterCounts[cluster];
173       }
174
175       for (int k = 0; k < numClusters; ++k)
176         if (clusterCounts[k] == 0)
177           return false; // bad clustering. no change to means[][]
178
179       // update, zero-out means so it can be used as scratch matrix
180       for (int k = 0; k < means.Length; ++k)
181         for (int j = 0; j < means[k].Length; ++j)
182           means[k][j] = 0.0;
183
184       for (int i = 0; i < data.Length; ++i)
185       {
186         int cluster = clustering[i];
187         for (int j = 0; j < data[i].Length; ++j)
188           means[cluster][j] += data[i][j]; // accumulate sum
189       }
190
191       for (int k = 0; k < means.Length; ++k)
192         for (int j = 0; j < means[k].Length; ++j)
193           means[k][j] /= clusterCounts[k]; // danger of div by 0
194       return true;
195     }
196
197     private static bool UpdateClustering(double[][] data, int[] clustering, double[][] means)
198     {
199       // (re)assign each tuple to a cluster (closest mean)
200       // returns false if no tuple assignments change OR
201       // if the reassignment would result in a clustering where
202       // one or more clusters have no tuples.
203
204       int numClusters = means.Length;
205       bool changed = false;
206
207       int[] newClustering = new int[clustering.Length]; // proposed result
208       Array.Copy(clustering, newClustering, clustering.Length);
209
210       double[] distances = new double[numClusters]; // distances from curr tuple to each mean
211
212       for (int i = 0; i < data.Length; ++i) // walk thru each tuple
213       {
214         for (int k = 0; k < numClusters; ++k)
215           distances[k] = Distance(data[i], means[k]); // compute distances from curr tuple to all k means
216
217         int newClusterID = MinIndex(distances); // find closest mean ID
218         if (newClusterID != newClustering[i])
219         {
220           changed = true;
221           newClustering[i] = newClusterID; // update
222         }
223       }
224
225       if (changed == false)
226         return false; // no change so bail and don‘t update clustering[][]
227
228       // check proposed clustering[] cluster counts
229       int[] clusterCounts = new int[numClusters];
230       for (int i = 0; i < data.Length; ++i)
231       {
232         int cluster = newClustering[i];
233         ++clusterCounts[cluster];
234       }
235
236       for (int k = 0; k < numClusters; ++k)
237         if (clusterCounts[k] == 0)
238           return false; // bad clustering. no change to clustering[][]
239
240       Array.Copy(newClustering, clustering, newClustering.Length); // update
241       return true; // good clustering and at least one change
242     }
243
244     private static double Distance(double[] tuple, double[] mean)
245     {
246       // Euclidean distance between two vectors for UpdateClustering()
247       // consider alternatives such as Manhattan distance
248       double sumSquaredDiffs = 0.0;
249       for (int j = 0; j < tuple.Length; ++j)
250         sumSquaredDiffs += Math.Pow((tuple[j] - mean[j]), 2);
251       return Math.Sqrt(sumSquaredDiffs);
252     }
253
254     private static int MinIndex(double[] distances)
255     {
256       // index of smallest value in array
257       // helper for UpdateClustering()
258       int indexOfMin = 0;
259       double smallDist = distances[0];
260       for (int k = 0; k < distances.Length; ++k)
261       {
262         if (distances[k] < smallDist)
263         {
264           smallDist = distances[k];
265           indexOfMin = k;
266         }
267       }
268       return indexOfMin;
269     }
270
271     // misc display helpers for demo
272
273     static void ShowData(double[][] data, int decimals, bool indices, bool newLine)
274     {
275       for (int i = 0; i < data.Length; ++i)
276       {
277         if (indices) Console.Write(i.ToString().PadLeft(3) + " ");
278         for (int j = 0; j < data[i].Length; ++j)
279         {
280           if (data[i][j] >= 0.0) Console.Write(" ");
281           Console.Write(data[i][j].ToString("F" + decimals) + " ");
282         }
283         Console.WriteLine("");
284       }
285       if (newLine) Console.WriteLine("");
286     }
287
288     static void ShowVector(int[] vector, bool newLine)
289     {
290       for (int i = 0; i < vector.Length; ++i)
291         Console.Write(vector[i] + " ");
292       if (newLine) Console.WriteLine("\n");
293     }
294
295     static void ShowClustered(double[][] data, int[] clustering, int k, int decimals)
296     {
297       for (int w = 0; w < k; ++w)
298       {
299         Console.WriteLine("===================");
300         for (int i = 0; i < data.Length; ++i)
301         {
302           int clusterID = clustering[i];
303           if (clusterID != w) continue;
304           Console.Write(i.ToString().PadLeft(3) + " ");
305           for (int j = 0; j < data[i].Length; ++j)
306           {
307             if (data[i][j] >= 0.0) Console.Write(" ");
308             Console.Write(data[i][j].ToString("F" + decimals) + " ");
309           }
310           Console.WriteLine("");
311         }
312         Console.WriteLine("===================");
313       }
314     }
315   }
316 }

参考资料

本篇文章《K-Means 聚类算法》由 Dennis Gao 发表自博客园个人博客,未经作者本人同意禁止以任何的形式转载,任何自动的或人为的爬虫转载行为均为耍流氓。

时间: 2024-10-27 03:19:56

K-Means 聚类算法的相关文章

k-均值聚类算法;二分k均值聚类算法

根据<机器学习实战>一书第十章学习k均值聚类算法和二分k均值聚类算法,自己把代码边敲边理解了一下,修正了一些原书中代码的细微差错.目前代码有时会出现如下4种报错信息,这有待继续探究和完善. 报错信息: Warning (from warnings module): File "F:\Python2.7.6\lib\site-packages\numpy\core\_methods.py", line 55 warnings.warn("Mean of empty

机器学习实战笔记-利用K均值聚类算法对未标注数据分组

聚类是一种无监督的学习,它将相似的对象归到同一个簇中.它有点像全自动分类.聚类方法几乎可以应用于所有对象,簇内的对象越相似,聚类的效果越好 簇识别给出聚类结果的含义.假定有一些数据,现在将相似数据归到一起,簇识别会告诉我们这些簇到底都是些什么.聚类与分类的最大不同在于,分类的目标事先巳知,而聚类则不一样.因为其产生的结果与分类相同,而只是类别没有预先定义,聚类有时也被称为无监督分类(unsupervised classification ). 聚类分析试图将相似对象归人同一簇,将不相似对象归到不

K均值聚类算法

k均值聚类算法(k-means clustering algorithm)是一种迭代求解的聚类分析算法,其步骤是随机选取K个对象作为初始的聚类中心,然后计算每个对象与各个种子聚类中心之间的距离,把每个对象分配给距离它最近的聚类中心.聚类中心以及分配给它们的对象就代表一个聚类.每分配一个样本,聚类的聚类中心会根据聚类中现有的对象被重新计算.这个过程将不断重复直到满足某个终止条件.终止条件可以是没有(或最小数目)对象被重新分配给不同的聚类,没有(或最小数目)聚类中心再发生变化,误差平方和局部最小.

基于改进人工蜂群算法的K均值聚类算法(附MATLAB版源代码)

其实一直以来也没有准备在园子里发这样的文章,相对来说,算法改进放在园子里还是会稍稍显得格格不入.但是最近邮箱收到的几封邮件让我觉得有必要通过我的博客把过去做过的东西分享出去更给更多需要的人.从论文刊登后,陆陆续续收到本科生.研究生还有博士生的来信和短信微信等,表示了对论文的兴趣以及寻求算法的效果和实现细节,所以,我也就通过邮件或者短信微信来回信,但是有时候也会忘记回复. 另外一个原因也是时间久了,我对于论文以及改进的算法的记忆也越来越模糊,或者那天无意间把代码遗失在哪个角落,真的很难想象我还会全

K均值聚类算法的MATLAB实现

1.K-均值聚类法的概述 之前在参加数学建模的过程中用到过这种聚类方法,但是当时只是简单知道了在matlab中如何调用工具箱进行聚类,并不是特别清楚它的原理.最近因为在学模式识别,又重新接触了这种聚类算法,所以便仔细地研究了一下它的原理.弄懂了之后就自己手工用matlab编程实现了,最后的结果还不错,嘿嘿~~~ 简单来说,K-均值聚类就是在给定了一组样本(x1, x2, ...xn) (xi, i = 1, 2, ... n均是向量) 之后,假设要将其聚为 m(<n) 类,可以按照如下的步骤实现

k means聚类过程

k-means是一种非监督 (从下图0 当中我们可以看到训练数据并没有标签标注类别)的聚类算法 0.initial 1.select centroids randomly 2.assign points 3.update centroids 4.reassign points 5.update centroids 6.reassign points 7.iteration reference: https://www.naftaliharris.com/blog/visualizing-k-me

聚类之K均值聚类和EM算法

这篇博客整理K均值聚类的内容,包括: 1.K均值聚类的原理: 2.初始类中心的选择和类别数K的确定: 3.K均值聚类和EM算法.高斯混合模型的关系. 一.K均值聚类的原理 K均值聚类(K-means)是一种基于中心的聚类算法,通过迭代,将样本分到K个类中,使得每个样本与其所属类的中心或均值的距离之和最小. 1.定义损失函数 假设我们有一个数据集{x1, x2,..., xN},每个样本的特征维度是m维,我们的目标是将数据集划分为K个类别.假定K的值已经给定,那么第k个类别的中心定义为μk,k=1

[聚类算法]K-means优缺点及其改进

[聚类算法]K-means优缺点及其改进 [转]:http://blog.csdn.net/u010536377/article/details/50884416 K-means聚类小述 大家接触的第一个聚类方法,十有八九都是K-means聚类啦.该算法十分容易理解,也很容易实现.其实几乎所有的机器学习和数据挖掘算法都有其优点和缺点.那么K-means的缺点是什么呢? 总结为下: (1)对于离群点和孤立点敏感: (2)k值选择; (3)初始聚类中心的选择: (4)只能发现球状簇. 对于这4点呢的

scikit-learn学习之K-means聚类算法与 Mini Batch K-Means算法

====================================================================== 本系列博客主要参考 Scikit-Learn 官方网站上的每一个算法进行,并进行部分翻译,如有错误,请大家指正 转载请注明出处 ====================================================================== K-means算法分析与Python代码实现请参考之前的两篇博客: <机器学习实战>k

[聚类算法] K-means 算法

聚类 和 k-means简单概括. 聚类是一种 无监督学习 问题,它的目标就是基于 相似度 将相似的子集聚合在一起. k-means算法是聚类分析中使用最广泛的算法之一.它把n个对象根据它们的属性分为k个聚类,以便使得所获得的聚类满足: 同一聚类中的对象相似度较高:而不同聚类中的对象相似度较小. k - means的算法原理: