更改

K-means聚类 (查看源代码)

2020年4月22日 (三) 00:10的版本

删除9字节、 2020年4月22日 (三) 00:10

第106行：第106行：

'''初始化方法'''

−

常用的初始化方法包括Forgy方法和随机分区方法。<ref name="hamerly4">{{Cite conference |last1=Hamerly |first1=Greg |last2=Elkan |first2=Charles |year=2002 |title=Alternatives to the ''k''-means algorithm that find better clusterings |url=http://people.csail.mit.edu/tieu/notebook/kmeans/15_p600-hamerly.pdf |booktitle=Proceedings of the eleventh international conference on Information and knowledge management (CIKM) }}</ref>Forgy方法从数据集中随机选择k个观测值，并将其用作初始均值。随机分区方法首先为每个观测值随机分配一个聚类，然后进入更新步骤，从而计算初始均值作为聚类的随机分配点的质心。Forgy方法趋向于散布初始均值，而随机分区将所有均值都靠近数据集的中心。根据Hamerly等人的观点，<ref name="hamerly4"></ref>对于诸如[[K调和均值 K-harmonic means]]和[[模糊k均值 fuzzy k-means]]的算法，通常首选随机分配方法。为了期望最大化和标准如果采用k-means算法，则最好使用Forgy初始化方法。<ref>{{cite journal |last1=Celebi |first1=M. E. |last2=Kingravi |first2=H. A. |last3=Vela |first3=P. A. |year=2013 |title=A comparative study of efficient initialization methods for the ''k''-means clustering algorithm |journal=[[Expert Systems with Applications]] |volume=40 |issue=1 |pages=200–210 |arxiv=1209.1960 |doi=10.1016/j.eswa.2012.07.021 }}</ref> 然而，Celebi等人的一项综合研究<ref>{{Cite conference |last1=Bradley |first1=Paul S. |last2=Fayyad |first2=Usama M. |year=1998 |title=Refining Initial Points for ''k''-Means Clustering |book-title=Proceedings of the Fifteenth International Conference on Machine Learning }}</ref>发现，流行的初始化方法（例如：Forgy，Random Partition和Maximin）通常效果较差，而Bradley和Fayyad提出的方法<ref>{{Cite conference |last1=Bradley |first1=Paul S. |last2=Fayyad |first2=Usama M. |year=1998 |title=Refining Initial Points for ''k''-Means Clustering |book-title=Proceedings of the Fifteenth International Conference on Machine Learning }}</ref>在表现优秀，k-means++表现一般。

+

常用的初始化方法包括Forgy方法和随机分区方法。<ref name="hamerly4">{{Cite conference |last1=Hamerly |first1=Greg |last2=Elkan |first2=Charles |year=2002 |title=Alternatives to the ''k''-means algorithm that find better clusterings |url=http://people.csail.mit.edu/tieu/notebook/kmeans/15_p600-hamerly.pdf |booktitle=Proceedings of the eleventh international conference on Information and knowledge management (CIKM) }}</ref>Forgy方法从数据集中随机选择k个观测值，并将其用作初始均值。随机分区方法首先为每个观测值随机分配一个聚类，然后进入更新步骤，从而计算初始均值作为聚类的随机分配点的质心。Forgy方法趋向于散布初始均值，而随机分区将所有均值都靠近数据集的中心。根据Hamerly等人的观点，<ref name="hamerly4"></ref>对于诸如[[K调和均值 K-harmonic means]]和[[模糊k均值 fuzzy k-means]]的算法，通常首选随机分配方法。为了期望最大化和标准如果采用k-means算法，则最好使用Forgy初始化方法。<ref>{{cite journal |last1=Celebi |first1=M. E. |last2=Kingravi |first2=H. A. |last3=Vela |first3=P. A. |year=2013 |title=A comparative study of efficient initialization methods for the ''k''-means clustering algorithm |journal=Expert Systems with Applications|volume=40 |issue=1 |pages=200–210 |arxiv=1209.1960 |doi=10.1016/j.eswa.2012.07.021 }}</ref> 然而，Celebi等人的一项综合研究<ref>{{Cite conference |last1=Bradley |first1=Paul S. |last2=Fayyad |first2=Usama M. |year=1998 |title=Refining Initial Points for ''k''-Means Clustering |book-title=Proceedings of the Fifteenth International Conference on Machine Learning }}</ref>发现，流行的初始化方法（例如：Forgy，Random Partition和Maximin）通常效果较差，而Bradley和Fayyad提出的方法<ref>{{Cite conference |last1=Bradley |first1=Paul S. |last2=Fayyad |first2=Usama M. |year=1998 |title=Refining Initial Points for ''k''-Means Clustering |book-title=Proceedings of the Fifteenth International Conference on Machine Learning }}</ref>在表现优秀，k-means++表现一般。

第116行：第116行： −

该算法不能保证收敛到全局最优，其结果可能取决于初始簇。由于该算法通常速度很快，因此通常会在不同的起始条件下多次运行该算法。然而，最坏情况下的性能可能会很慢：特别是某些点集，甚至在两个维度，会聚指数时间，即{{math|2<sup>Ω(<var>n</var>)</sup>}}。<ref>{{cite journal |last=Vattani |first=A. |year=2011 |title=k-means requires exponentially many iterations even in the plane |url=http://cseweb.ucsd.edu/users/avattani/papers/kmeans-journal.pdf |journal=[[Discrete and Computational Geometry]] |volume=45 |issue=4 |pages=596–616 |doi=10.1007/s00454-011-9340-1 }}</ref> 这些点集似乎在实践中似乎没有出现：k-means的平滑运行时间是多项式，这一事实得到了证实。<ref name="Arthur, David; Manthey, B.; Roeglin, H. 20092">{{cite conference |last1=Arthur |first1=David |last2=Manthey |first2=B. |last3=Roeglin |first3=H. |year=2009 |title=k-means has polynomial smoothed complexity |booktitle=Proceedings of the 50th Symposium on Foundations of Computer Science (FOCS) |arxiv=0904.1113 }}</ref>

+

该算法不能保证收敛到全局最优，其结果可能取决于初始簇。由于该算法通常速度很快，因此通常会在不同的起始条件下多次运行该算法。然而，最坏情况下的性能可能会很慢：特别是某些点集，甚至在两个维度，会聚指数时间，即{{math|2<sup>Ω(<var>n</var>)</sup>}}。<ref>{{cite journal |last=Vattani |first=A. |year=2011 |title=k-means requires exponentially many iterations even in the plane |url=http://cseweb.ucsd.edu/users/avattani/papers/kmeans-journal.pdf |journal=Discrete and Computational Geometry |volume=45 |issue=4 |pages=596–616 |doi=10.1007/s00454-011-9340-1 }}</ref> 这些点集似乎在实践中似乎没有出现：k-means的平滑运行时间是多项式，这一事实得到了证实。<ref name="Arthur, David; Manthey, B.; Roeglin, H. 20092">{{cite conference |last1=Arthur |first1=David |last2=Manthey |first2=B. |last3=Roeglin |first3=H. |year=2009 |title=k-means has polynomial smoothed complexity |booktitle=Proceedings of the 50th Symposium on Foundations of Computer Science (FOCS) |arxiv=0904.1113 }}</ref>

“分配”步骤被称为“期望步骤”，而“更新步骤”是最大化步骤，这使得该算法成为广义期望最大化算法的变体。

薄荷

7,129

个编辑

更改

K-means聚类 (查看源代码)

2020年4月22日 (三) 00:10的版本

导航菜单

搜索