更改

层次聚类 (查看源代码)

2020年9月1日 (二) 19:00的版本

添加466字节、 2020年9月1日 (二) 19:00

无编辑摘要

第1行：第1行： −

~~此词条暂由彩云小译翻译，未经人工整理和审校，带来阅读不便，请见谅。~~

+

此词条暂由水流心不竞翻译，未经审校，带来阅读不便，请见谅。

第13行：第13行：

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

−

~~在数据挖掘和统计学中，层次聚类~~(也称为层次数据聚类聚类或 HCA)~~是一种数据聚类聚类的方法，它寻求建立一个集群层次结构。层次聚类的策略通常分为两类~~:

+

在数据挖掘和统计学中，''' 层次聚类Hierarchical clustering'''(也称为层次数据聚类聚类或 HCA)是一种数据聚类的方法，它旨在建立一个集群层次结构。''' 层次聚类Hierarchical clustering'''的策略通常分为两类:

* '''Agglomerative''': This is a "[[Top-down and bottom-up design|bottom-up]]" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

第25行：第25行：

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering<ref>{{cite book | author=Frank Nielsen | title=Introduction to HPC with MPI for Data Science | year=2016 | publisher=Springer |

−

~~一般来说，合并和分裂是以贪婪的方式决定的。层次聚类的结果~~ < ref > { cite book | author = Frank Nielsen | title = Introduction to HPC with MPI for Data Science | year = 2016 | publisher = Springer |

+

一般来说，合并和分裂是以贪婪的方式决定的。''' 层次聚类Hierarchical clustering'''的结果 < ref > { cite book | author = Frank Nielsen | title = Introduction to HPC with MPI for Data Science | year = 2016 | publisher = Springer |

chapter=Chapter 8: Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a [[dendrogram]].

第31行：第31行：

chapter=Chapter 8: Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a dendrogram.

−

第八章: ~~层次聚类~~ | url = https://www.springer.com/gp/book/9783319219028 | Chapter-url = https://www.researchgate.net/publication/314700681} </ref > 通常在树状图中呈现。

+

第八章: ''' 层次聚类Hierarchical clustering''' | url = https://www.springer.com/gp/book/9783319219028 | Chapter-url = https://www.researchgate.net/publication/314700681} </ref > 通常在树状图中呈现。

第39行：第39行：

The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of <math>\mathcal{O}(n^3)</math> and requires <math>\mathcal{O}(n^2)</math> memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity <math>\mathcal{O}(n^2)</math>) are known: SLINK for single-linkage and CLINK for complete-linkage clustering. With a heap the runtime of the general case can be reduced to <math>\mathcal{O}(n^2 \log n)</math> at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable.

−

~~层次凝聚聚类~~(HAC)的标准算法的时间复杂度为 < math > mathical { o }(n ^ 3) </math > ，并且需要 < math > mathcal { o }(n ^ 2) </math > 内存，这使得它对于中等数据集来说太慢了。然而，对于某些特殊情况，已知的最佳有效凝聚方法(复杂度 < math > mathcal { o }(n ^ 2) </math >)是: 单连锁的 SLINK < ! ——粗体 wp: r # pla-- > 和完全连锁的 CLINK。对于堆，一般情况下的运行时可以缩减为 < math > mathcal { o }(n ^ 2 log n) </math > ，代价是进一步增加内存需求。在许多情况下，这种方法的内存开销太大，无法实际使用。

+

''' 层次凝聚聚类Hierarchical agglomerative clustering'''(HAC)的标准算法的时间复杂度为 < math > mathical { o }(n ^ 3) </math > ，并且需要 < math > mathcal { o }(n ^ 2) </math > 内存，这使得它对于中等数据集来说太慢了。然而，对于某些特殊情况，已知的最佳有效凝聚方法(复杂度 < math > mathcal { o }(n ^ 2) </math >)是: 单连锁的 SLINK < ! ——粗体 wp: r # pla-- > 和完全连锁的 CLINK。对于堆，一般情况下的运行时可以缩减为 < math > mathcal { o }(n ^ 2 log n) </math > ，代价是进一步增加内存需求。在许多情况下，这种方法的内存开销太大，无法实际使用。

第55行：第55行：

Divisive clustering with an exhaustive search is <math>\mathcal{O}(2^n)</math>, but it is common to use faster heuristics to choose splits, such as k-means.

−

~~穷举搜索的分裂聚类是~~ < math > mathcal { o }(2 ^ n) </math > ，但是通常使用更快的启发式来选择分裂，比如 k-means。

+

穷举搜索的分裂群集是 < math > mathcal { o }(2 ^ n) </math > ，但是通常使用更快的启发式来选择分裂，比如 k-means。

−

== Cluster dissimilarity ==

+

== Cluster dissimilarity 簇异性==

In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate [[metric (mathematics)|metric]] (a measure of [[distance]] between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.

第69行：第69行： −

=== Metric ===

+

=== Metric 度量标准===

第87行：第87行：

Some commonly used metrics for hierarchical clustering are:

−

~~一些常用的层次聚类指标如下~~:

+

一些常用的''' 层次聚类Hierarchical clustering'''指标如下:

{|class="wikitable"

第225行：第225行： −

=== Linkage criteria ===

+

=== Linkage criteria 连接准则===

The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.

第441行：第441行： −

== Discussion ==

+

== Discussion 讨论==

Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances.

第447行：第447行：

Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances.

−

~~层次聚类具有明显的优势，可以使用任何有效的距离度量。事实上，观测本身并不是必需的~~: 所用的只是一个距离矩阵。

+

''' 层次聚类Hierarchical clustering'''具有明显的优势，可以使用任何有效的距离度量。事实上，观测本身并不是必需的: 所用的只是一个距离矩阵。

−

== Agglomerative clustering example ==

+

== Agglomerative clustering example 凝聚聚类实例==

水流心不竞

561

个编辑

更改

层次聚类 (查看源代码)

2020年9月1日 (二) 19:00的版本

导航菜单

搜索