更改

添加466字节 、 2020年9月1日 (二) 19:00
无编辑摘要
第1行: 第1行: −
此词条暂由彩云小译翻译,未经人工整理和审校,带来阅读不便,请见谅。
+
此词条暂由水流心不竞翻译,未经审校,带来阅读不便,请见谅。
    
{{Redirect|SLINK|the online magazine|Slink}}
 
{{Redirect|SLINK|the online magazine|Slink}}
第13行: 第13行:  
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:
 
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:
   −
在数据挖掘和统计学中,层次聚类(也称为层次数据聚类聚类或 HCA)是一种数据聚类聚类的方法,它寻求建立一个集群层次结构。层次聚类的策略通常分为两类:
+
在数据挖掘和统计学中,'''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>'''(也称为层次数据聚类聚类或 HCA)是一种数据聚类的方法,它旨在建立一个集群层次结构。'''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>'''的策略通常分为两类:
    
* '''Agglomerative''': This is a "[[Top-down and bottom-up design|bottom-up]]" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
 
* '''Agglomerative''': This is a "[[Top-down and bottom-up design|bottom-up]]" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
第25行: 第25行:  
In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering<ref>{{cite book | author=Frank Nielsen | title=Introduction to HPC with MPI for Data Science |  year=2016 | publisher=Springer |
 
In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering<ref>{{cite book | author=Frank Nielsen | title=Introduction to HPC with MPI for Data Science |  year=2016 | publisher=Springer |
   −
一般来说,合并和分裂是以贪婪的方式决定的。层次聚类的结果 < ref > { cite book | author = Frank Nielsen | title = Introduction to HPC with MPI for Data Science | year = 2016 | publisher = Springer |  
+
一般来说,合并和分裂是以贪婪的方式决定的。'''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>'''的结果 < ref > { cite book | author = Frank Nielsen | title = Introduction to HPC with MPI for Data Science | year = 2016 | publisher = Springer |  
    
chapter=Chapter 8: Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a [[dendrogram]].
 
chapter=Chapter 8: Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a [[dendrogram]].
第31行: 第31行:  
chapter=Chapter 8: Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a dendrogram.
 
chapter=Chapter 8: Hierarchical Clustering | url=https://www.springer.com/gp/book/9783319219028 |chapter-url=https://www.researchgate.net/publication/314700681 }}</ref> are usually presented in a dendrogram.
   −
第八章: 层次聚类 | url =  https://www.springer.com/gp/book/9783319219028 | Chapter-url =  https://www.researchgate.net/publication/314700681} </ref > 通常在树状图中呈现。
+
第八章: '''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>''' | url =  https://www.springer.com/gp/book/9783319219028 | Chapter-url =  https://www.researchgate.net/publication/314700681} </ref > 通常在树状图中呈现。
      第39行: 第39行:  
The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of <math>\mathcal{O}(n^3)</math> and requires <math>\mathcal{O}(n^2)</math> memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity <math>\mathcal{O}(n^2)</math>) are known: SLINK<!--boldface per WP:R#PLA--> for single-linkage and CLINK for complete-linkage clustering. With a heap the runtime of the general case can be reduced to <math>\mathcal{O}(n^2 \log n)</math> at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable.
 
The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of <math>\mathcal{O}(n^3)</math> and requires <math>\mathcal{O}(n^2)</math> memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity <math>\mathcal{O}(n^2)</math>) are known: SLINK<!--boldface per WP:R#PLA--> for single-linkage and CLINK for complete-linkage clustering. With a heap the runtime of the general case can be reduced to <math>\mathcal{O}(n^2 \log n)</math> at the cost of further increasing the memory requirements. In many cases, the memory overheads of this approach are too large to make it practically usable.
   −
层次凝聚聚类(HAC)的标准算法的时间复杂度为 < math > mathical { o }(n ^ 3) </math > ,并且需要 < math > mathcal { o }(n ^ 2) </math > 内存,这使得它对于中等数据集来说太慢了。然而,对于某些特殊情况,已知的最佳有效凝聚方法(复杂度 < math > mathcal { o }(n ^ 2) </math >)是: 单连锁的 SLINK < ! ——粗体 wp: r # pla-- > 和完全连锁的 CLINK。对于堆,一般情况下的运行时可以缩减为 < math > mathcal { o }(n ^ 2 log n) </math > ,代价是进一步增加内存需求。在许多情况下,这种方法的内存开销太大,无法实际使用。
+
'''<font color="#ff8000"> 层次凝聚聚类Hierarchical agglomerative clustering</font>'''(HAC)的标准算法的时间复杂度为 < math > mathical { o }(n ^ 3) </math > ,并且需要 < math > mathcal { o }(n ^ 2) </math > 内存,这使得它对于中等数据集来说太慢了。然而,对于某些特殊情况,已知的最佳有效凝聚方法(复杂度 < math > mathcal { o }(n ^ 2) </math >)是: 单连锁的 SLINK < ! ——粗体 wp: r # pla-- > 和完全连锁的 CLINK。对于堆,一般情况下的运行时可以缩减为 < math > mathcal { o }(n ^ 2 log n) </math > ,代价是进一步增加内存需求。在许多情况下,这种方法的内存开销太大,无法实际使用。
      第55行: 第55行:  
Divisive clustering with an exhaustive search is <math>\mathcal{O}(2^n)</math>, but it is common to use faster heuristics to choose splits, such as k-means.
 
Divisive clustering with an exhaustive search is <math>\mathcal{O}(2^n)</math>, but it is common to use faster heuristics to choose splits, such as k-means.
   −
穷举搜索的分裂聚类是 < math > mathcal { o }(2 ^ n) </math > ,但是通常使用更快的启发式来选择分裂,比如 k-means。
+
穷举搜索的分裂群集是 < math > mathcal { o }(2 ^ n) </math > ,但是通常使用更快的启发式来选择分裂,比如 k-means。
         −
== Cluster dissimilarity ==
+
== Cluster dissimilarity 簇异性==
    
In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate [[metric (mathematics)|metric]] (a measure of [[distance]] between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.
 
In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate [[metric (mathematics)|metric]] (a measure of [[distance]] between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.
第69行: 第69行:       −
=== Metric ===
+
=== Metric 度量标准===
    
{{Further information|Metric (mathematics)}}
 
{{Further information|Metric (mathematics)}}
第87行: 第87行:  
Some commonly used metrics for hierarchical clustering are:
 
Some commonly used metrics for hierarchical clustering are:
   −
一些常用的层次聚类指标如下:
+
一些常用的'''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>'''指标如下:
    
{|class="wikitable"
 
{|class="wikitable"
第225行: 第225行:       −
=== Linkage criteria ===
+
=== Linkage criteria 连接准则===
    
The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.
 
The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.
第441行: 第441行:       −
== Discussion ==
+
== Discussion 讨论==
    
Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances.
 
Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances.
第447行: 第447行:  
Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances.
 
Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances.
   −
层次聚类具有明显的优势,可以使用任何有效的距离度量。事实上,观测本身并不是必需的: 所用的只是一个距离矩阵。
+
'''<font color="#ff8000"> 层次聚类Hierarchical clustering</font>'''具有明显的优势,可以使用任何有效的距离度量。事实上,观测本身并不是必需的: 所用的只是一个距离矩阵。
         −
== Agglomerative clustering example ==
+
== Agglomerative clustering example 凝聚聚类实例==
     
561

个编辑