第1行: |
第1行: |
− | 此词条暂由彩云小译翻译,未经人工整理和审校,带来阅读不便,请见谅。
| + | 此词条暂由水流心不竞翻译,未经审校,带来阅读不便,请见谅。 |
| | | |
| {{Redirect|SLINK|the online magazine|Slink}} | | {{Redirect|SLINK|the online magazine|Slink}} |
第13行: |
第13行: |
| In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: | | In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: |
| | | |
− | 在数据挖掘和统计学中,层次聚类(也称为层次数据聚类聚类或 HCA)是一种数据聚类聚类的方法,它寻求建立一个集群层次结构。层次聚类的策略通常分为两类:
| + | 在数据挖掘和统计学中,'''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>'''(也称为层次数据聚类聚类或 HCA)是一种数据聚类的方法,它旨在建立一个集群层次结构。'''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>'''的策略通常分为两类: |
| | | |
| * '''Agglomerative''': This is a "[[Top-down and bottom-up design|bottom-up]]" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. | | * '''Agglomerative''': This is a "[[Top-down and bottom-up design|bottom-up]]" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. |
第25行: |
第25行: |
| In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering<ref>{{cite book | author=Frank Nielsen | title=Introduction to HPC with MPI for Data Science | year=2016 | publisher=Springer | | | In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering<ref>{{cite book | author=Frank Nielsen | title=Introduction to HPC with MPI for Data Science | year=2016 | publisher=Springer | |
| | | |
− | 一般来说,合并和分裂是以贪婪的方式决定的。层次聚类的结果 < ref > { cite book | author = Frank Nielsen | title = Introduction to HPC with MPI for Data Science | year = 2016 | publisher = Springer |
| + | 一般来说,合并和分裂是以贪婪的方式决定的。'''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>'''的结果 < ref > { cite book | author = Frank Nielsen | title = Introduction to HPC with MPI for Data Science | year = 2016 | publisher = Springer | |
| | | |
| chapter=Chapter 8: Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a [[dendrogram]]. | | chapter=Chapter 8: Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a [[dendrogram]]. |
第31行: |
第31行: |
| chapter=Chapter 8: Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a dendrogram. | | chapter=Chapter 8: Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a dendrogram. |
| | | |
− | 第八章: 层次聚类 | url = https://www.springer.com/gp/book/9783319219028 | Chapter-url = https://www.researchgate.net/publication/314700681} </ref > 通常在树状图中呈现。 | + | 第八章: '''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>''' | url = https://www.springer.com/gp/book/9783319219028 | Chapter-url = https://www.researchgate.net/publication/314700681} </ref > 通常在树状图中呈现。 |
| | | |
| | | |
第39行: |
第39行: |
| The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of <math>\mathcal{O}(n^3)</math> and requires <math>\mathcal{O}(n^2)</math> memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity <math>\mathcal{O}(n^2)</math>) are known: SLINK<!--boldface per WP:R#PLA--> for single-linkage and CLINK for complete-linkage clustering. With a heap the runtime of the general case can be reduced to <math>\mathcal{O}(n^2 \log n)</math> at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable. | | The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of <math>\mathcal{O}(n^3)</math> and requires <math>\mathcal{O}(n^2)</math> memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity <math>\mathcal{O}(n^2)</math>) are known: SLINK<!--boldface per WP:R#PLA--> for single-linkage and CLINK for complete-linkage clustering. With a heap the runtime of the general case can be reduced to <math>\mathcal{O}(n^2 \log n)</math> at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable. |
| | | |
− | 层次凝聚聚类(HAC)的标准算法的时间复杂度为 < math > mathical { o }(n ^ 3) </math > ,并且需要 < math > mathcal { o }(n ^ 2) </math > 内存,这使得它对于中等数据集来说太慢了。然而,对于某些特殊情况,已知的最佳有效凝聚方法(复杂度 < math > mathcal { o }(n ^ 2) </math >)是: 单连锁的 SLINK < ! ——粗体 wp: r # pla-- > 和完全连锁的 CLINK。对于堆,一般情况下的运行时可以缩减为 < math > mathcal { o }(n ^ 2 log n) </math > ,代价是进一步增加内存需求。在许多情况下,这种方法的内存开销太大,无法实际使用。
| + | '''<font color="#ff8000"> 层次凝聚聚类Hierarchical agglomerative clustering</font>'''(HAC)的标准算法的时间复杂度为 < math > mathical { o }(n ^ 3) </math > ,并且需要 < math > mathcal { o }(n ^ 2) </math > 内存,这使得它对于中等数据集来说太慢了。然而,对于某些特殊情况,已知的最佳有效凝聚方法(复杂度 < math > mathcal { o }(n ^ 2) </math >)是: 单连锁的 SLINK < ! ——粗体 wp: r # pla-- > 和完全连锁的 CLINK。对于堆,一般情况下的运行时可以缩减为 < math > mathcal { o }(n ^ 2 log n) </math > ,代价是进一步增加内存需求。在许多情况下,这种方法的内存开销太大,无法实际使用。 |
| | | |
| | | |
第55行: |
第55行: |
| Divisive clustering with an exhaustive search is <math>\mathcal{O}(2^n)</math>, but it is common to use faster heuristics to choose splits, such as k-means. | | Divisive clustering with an exhaustive search is <math>\mathcal{O}(2^n)</math>, but it is common to use faster heuristics to choose splits, such as k-means. |
| | | |
− | 穷举搜索的分裂聚类是 < math > mathcal { o }(2 ^ n) </math > ,但是通常使用更快的启发式来选择分裂,比如 k-means。
| + | 穷举搜索的分裂群集是 < math > mathcal { o }(2 ^ n) </math > ,但是通常使用更快的启发式来选择分裂,比如 k-means。 |
| | | |
| | | |
| | | |
− | == Cluster dissimilarity == | + | == Cluster dissimilarity 簇异性== |
| | | |
| In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate [[metric (mathematics)|metric]] (a measure of [[distance]] between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets. | | In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate [[metric (mathematics)|metric]] (a measure of [[distance]] between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets. |
第69行: |
第69行: |
| | | |
| | | |
− | === Metric === | + | === Metric 度量标准=== |
| | | |
| {{Further information|Metric (mathematics)}} | | {{Further information|Metric (mathematics)}} |
第87行: |
第87行: |
| Some commonly used metrics for hierarchical clustering are: | | Some commonly used metrics for hierarchical clustering are: |
| | | |
− | 一些常用的层次聚类指标如下:
| + | 一些常用的'''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>'''指标如下: |
| | | |
| {|class="wikitable" | | {|class="wikitable" |
第225行: |
第225行: |
| | | |
| | | |
− | === Linkage criteria === | + | === Linkage criteria 连接准则=== |
| | | |
| The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations. | | The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations. |
第441行: |
第441行: |
| | | |
| | | |
− | == Discussion == | + | == Discussion 讨论== |
| | | |
| Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances. | | Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances. |
第447行: |
第447行: |
| Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances. | | Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances. |
| | | |
− | 层次聚类具有明显的优势,可以使用任何有效的距离度量。事实上,观测本身并不是必需的: 所用的只是一个距离矩阵。
| + | '''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>'''具有明显的优势,可以使用任何有效的距离度量。事实上,观测本身并不是必需的: 所用的只是一个距离矩阵。 |
| | | |
| | | |
| | | |
− | == Agglomerative clustering example == | + | == Agglomerative clustering example 凝聚聚类实例== |
| | | |
| | | |