更改

社团结构 (查看源代码)

2020年4月9日 (四) 22:42的版本

添加9,249字节、 2020年4月9日 (四) 22:42

→‎社团检测算法

第93行：第93行： −

在任意网络中检测社团可能是一项困难的计算任务。网络中的社团数目(如果存在的话)通常是未知的，且社团规模和密度往往不同。尽管存在这些困难，一些检测社团的方法已经被发展和应用，并取得了不同程度的成功。

+

在任意网络中检测社团可能是一项困难的计算任务。网络中的社团数目(如果存在的话)通常是未知的，且社团规模和密度往往不同。尽管存在这些困难，一些检测社团的方法已经被发展和应用，并取得了不同程度的成功。<ref name = Notices/>

===最小割法===

第101行：第101行： −

在最小割法中，网络被分割成预定数量且大小相似的部分，选择这些部分使得节点组之间的边数达到最小。这种算法在许多应用程序中运行良好，但并不适用于检测一般网络中的社团结构，因为不论社团是否隐含在结构中，它只能找到数目固定的社团，而社团的数目通常是未知的。

+

在最小割法中，网络被分割成预定数量且大小相似的部分，选择这些部分使得节点组之间的边数达到最小。这种算法在许多应用程序中运行良好，但并不适用于检测一般网络中的社团结构，因为不论社团是否隐含在结构中，它只能找到数目固定的社团，而社团的数目通常是未知的。{{cite journal

+

| author = M. E. J. Newman

+

| year = 2004

+

| title = Detecting community structure in networks

+

| journal = Eur. Phys. J. B

+

| volume = 38

+

| issue = 2

+

| pages = 321–330

+

| doi = 10.1140/epjb/e2004-00124-y

+

| bibcode = 2004EPJB...38..321N

+

| hdl = 2027.42/43867

+

}}

+

</ref>

第110行：第122行： −

有几种常见的分组方案，其中最简单的两种是单链接聚类和'''完全链接聚类 Complete linkage clustering '''。前者在不同群组中的所有节点对的相似度小于给定的阈值的情况下，将两个群组视为独立的社团; 后者则是在每个群组中的所有节点对的相似度大于给定的阈值的前提之下进行分组。一个有趣的方法是使用各种相似或不同的测度，通过'''凸和 convex sums '''来改进层次聚类的性能。

+

有几种常见的分组方案，其中最简单的两种是单链接聚类和'''完全链接聚类 Complete linkage clustering '''。前者在不同群组中的所有节点对的相似度小于给定的阈值的情况下，将两个群组视为独立的社团; 后者则是在每个群组中的所有节点对的相似度大于给定的阈值的前提之下进行分组。一个有趣的方法是使用各种相似或不同的测度，通过'''凸和 convex sums '''来改进层次聚类的性能。<ref>{{Cite journal|title = Weighting dissimilarities to detect communities in networks|journal = Phil. Trans. R. Soc. A|date = 2015-12-13|issn = 1364-503X|pmid = 26527808|pages = 20150108|volume = 373|issue = 2056|doi = 10.1098/rsta.2015.0108|first = Alejandro J.|last = Alvarez|first2 = Carlos E.|last2 = Sanz-Rodríguez|first3 = Juan Luis|last3 = Cabrera|bibcode = 2015RSPTA.37350108A}}</ref>

第117行：第129行： −

另一个常见的社团检测算法是 '''Girvan-Newman 算法 Girvan–Newman algorithm '''：首先识别社群之间连接的边，然后去除这些边，留下的就是社团。该方法利用图论中的中心性度量来进行识别，当边位于多对节点之间时，给每条边赋予一个较大的数字。

+

另一个常见的社团检测算法是 '''Girvan-Newman 算法 Girvan–Newman algorithm '''<ref name=ComSocBio/> ：首先识别社群之间连接的边，然后去除这些边，留下的就是社团。该方法利用图论中的中心性度量来进行识别，当边位于多对节点之间时，给每条边赋予一个较大的数字。

−

'''Girvan-Newman算法'''返回的结果具有较好的质量，且由于它与许多标准软件包兼容，因此得到广泛应用。但其运行缓慢，在''n''个顶点和''m''条边的网络上耗费时长为''O(m2n)''，导致其并不适用于超过几千个节点的网络。

+

'''Girvan-Newman算法'''返回的结果具有较好的质量，且由于它与许多标准软件包兼容，因此得到广泛应用。但其运行缓慢，在''n''个顶点和''m''条边的网络上耗费时长为''O(m2n)''，导致其并不适用于超过几千个节点的网络。<ref name=fast>

+

{{cite journal

+

| author = M. E. J. Newman

+

| year = 2004

+

| title = Fast algorithm for detecting community structure in networks

+

| journal = Phys. Rev. E

+

| volume = 69

+

| issue = 6

+

| pages = 066133

+

| doi = 10.1103/PhysRevE.69.066133

+

| pmid = 15244693

+

| arxiv = cond-mat/0309508

+

| bibcode = 2004PhRvE..69f6133N

+

}}

+

</ref>

+

===模板度最大值===

−

尽管模板度最大值法有着它众所周知的缺点，但它依然是用于使用最为广泛的社团检测算法之一。 [[模块度 Modularity]]是一种衡量社团划分质量的标准，也是一个利益函数。模块度最大值法通过搜索一个或多个具有特别高的模块化的网络的可能划分来检测社团。由于对所有可能的划分进行穷举搜索十分困难，所以实际的算法都是基于近似优化方法，如贪婪算法，模拟退火或光谱优化。不同的算法使得检测速度和准确性之间达到不同类型的平衡。一种流行的模块度最大值法是 '''Louvain 算法 Louvain method '''，该算法迭代地优化本地社团，直到社团状态不再变化、全局模块化不能够再得到改善。一个利用RenEEL的算法是当前最优良的模块度最大值算法，该算法是一个极值集成学习'''EEL范式 Extremal Ensemble Learning paradigm '''的例子。

+

尽管模板度最大值法有着它众所周知的缺点，但它依然是用于使用最为广泛的社团检测算法之一。<ref name=fast/> [[模块度 Modularity]]是一种衡量社团划分质量的标准，也是一个利益函数。模块度最大值法通过搜索一个或多个具有特别高的模块化的网络的可能划分来检测社团。由于对所有可能的划分进行穷举搜索十分困难，所以实际的算法都是基于近似优化方法，如贪婪算法，模拟退火或光谱优化。不同的算法使得检测速度和准确性之间达到不同类型的平衡。<ref>

+

{{cite journal

+

+

| title = Comparing community structure identification

+

| journal = J. Stat. Mech.

+

| volume = 2005

+

| issue = 9

+

| pages = P09008

+

| doi = 10.1088/1742-5468/2005/09/P09008

+

|arxiv=cond-mat/0505245|bibcode=2005JSMTE..09..008D}}

+

</ref><ref>

+

{{cite journal

+

|author1=R. Guimera |author2=L. A. N. Amaral | year = 2005

+

| title = Functional cartography of complex metabolic networks

+

| journal = Nature

+

| volume = 433

+

| pages = 895–900

+

| doi = 10.1038/nature03288

+

| issue=7028

+

| pmid=15729348

+

| pmc=2175124

+

| arxiv=q-bio/0502035| bibcode=2005Natur.433..895G}}

+

</ref> 一种流行的模块度最大值法是 '''Louvain 算法 Louvain method '''，该算法迭代地优化本地社团，直到社团状态不再变化、全局模块化不能够再得到改善。<ref>

+

{{cite journal

+

+

| title = Fast unfolding of community hierarchies in large networks

+

| journal = J. Stat. Mech.

+

| volume = 2008

+

| issue = 10

+

| pages = P10008

+

| doi = 10.1088/1742-5468/2008/10/P10008

+

|arxiv=0803.0476|bibcode=2008JSMTE..10..008B}}

+

</ref><ref>{{cite web|title=Lightning-fast Community Detection in Social Media: A Scalable Implementation of the Louvain Algorithm|url=https://pdfs.semanticscholar.org/9fa0/3cde48aee448bef4de225cc1f4943ab72095.pdf|date=2013}}</ref>一个利用RenEEL的算法是当前最优良的模块度最大值算法，该算法是一个极值集成学习'''EEL范式 Extremal Ensemble Learning paradigm '''的例子。<ref>

+

{{cite journal

+

|author1=J. Guo |author2=P. Singh | author3=K.E. Bassler | year = 2019

+

| title = Reduced network extremal ensemble learning (RenEEL) scheme for community detection in complex networks

+

| journal = Scientific Reports

+

| volume = 9

+

| doi = 10.1038/s41598-019-50739-3

+

| issue = 14234

+

}}

+

</ref>

+

<ref name="RenEEL-Modularity">{{cite web|title=RenEEL-Modularity|url=https://github.com/kbassler/RenEEL-Modularity}}

+

</ref>

−

模块度优化的有效性是值得怀疑的，因为已经证明模块度优化受分辨率的大小限制，这导致我们往往不能检测到比某个规模更小的集群；另一方面，由于模块度值表征着具有高模块化的大量分区的简并度，接近绝对最大值，这可能与其他算法非常的不同。

+

模块度优化的有效性是值得怀疑的，因为已经证明模块度优化受分辨率的大小限制，这导致我们往往不能检测到比某个规模更小的集群；<ref>

+

{{cite journal

+

|author1=S. Fortunato |author2=M. Barthelemy | year = 2007

+

| title = Resolution limit in community detection

+

| journal = Proceedings of the National Academy of Sciences of the United States of America

+

| volume = 104

+

| pages = 36–41

+

| pmid = 17190818

+

| doi = 10.1073/pnas.0605965104

+

| issue = 1

+

| pmc = 1765466

+

| arxiv = physics/0607100| bibcode = 2007PNAS..104...36F}}

+

</ref>) 另一方面，由于模块度值表征着具有高模块化的大量分区的简并度，接近绝对最大值，这可能与其他算法非常的不同。<ref>

+

{{cite journal

+

|author1=B. H. Good |author2=Y.-A. de Montjoye |author3=A. Clauset | year = 2010

+

| title = The performance of modularity maximization in practical contexts

+

| journal = Phys. Rev. E

+

| volume = 81

+

| issue = 4

+

| pages = 046106

+

| doi = 10.1103/PhysRevE.81.046106

+

|pmid=20481785 |arxiv=0910.0165|bibcode=2010PhRvE..81d6106G}}

+

</ref>

===基于推论统计学的社团检测算法===

−

基于'''推论统计学 statistical inference ''' 的算法试图将'''生成模型数据 generative model '''和能对社团结构进行编码的网络数据相匹配。与其他办法相比，这种办法的总体优势在于其原则性更强，而且有能力从根本上解决具有统计意义的问题。文献中的算法大多是基于'''随机块模型 The stochastic block model '''~~以及混合隶属度、程度校正、层次结构等变量。~~ 模型选择可以使用一些有原则的算法，例如最小描述长度(或贝叶斯模型选择)和似然比检定。目前有许多算法可以对随机块模型进行有效的推理，包括置信传播和凝聚蒙特卡罗算法。与尝试给定目标函数对网络进行聚类的算法不同，这类算法是基于生成模型的。生成模型不仅可以描述网络的大规模结构，而且可以用来归纳数据和发现网络中的缺失或虚假链路。

+

基于'''推论统计学 statistical inference ''' 的算法试图将'''生成模型数据 generative model '''和能对社团结构进行编码的网络数据相匹配。与其他办法相比，这种办法的总体优势在于其原则性更强，而且有能力从根本上解决具有统计意义的问题。文献中的算法大多是基于'''随机块模型 The stochastic block model '''<ref>

+

{{Cite journal

+

| doi = 10.1016/0378-8733(83)90021-7

+

| issn = 0378-8733

+

| volume = 5

+

| issue = 2

+

| pages = 109–137

+

| last = Holland

+

| first = Paul W.

+

| author2 = Kathryn Blackmond Laskey| author3 = Samuel Leinhardt

+

| title = Stochastic blockmodels: First steps

+

| journal = Social Networks

+

| date = June 1983

+

}}

+

</ref>以及混合隶属度<ref>{{Cite journal

+

| issn = 1532-4435

+

| volume = 9

+

| pages = 1981–2014

+

| last = Airoldi

+

| first = Edoardo M. |authorlink1=Edoardo Airoldi

+

|author2=David M. Blei |author3=Stephen E. Fienberg |author4=Eric P. Xing

+

| title = Mixed Membership Stochastic Blockmodels

+

| journal = J. Mach. Learn. Res.

+

| accessdate = 2013-10-09

+

| date = June 2008

+

| url = http://dl.acm.org/citation.cfm?id=1390681.1442798

+

| pmid = 21701698

+

| pmc = 3119541

+

}}

+

</ref><ref>

+

{{Cite journal

+

| doi = 10.1103/PhysRevE.84.036103

+

| pmid = 22060452

+

| volume = 84

+

| issue = 3

+

| pages = 036103

+

| last = Ball

+

| first = Brian

+

|author2=Brian Karrer |author3=M. E. J. Newman

+

| title = Efficient and principled method for detecting communities in networks

+

| journal = Physical Review E

+

| date = 2011

+

| arxiv = 1104.3590

+

| bibcode = 2011PhRvE..84c6103B

+

}}

+

</ref>、程度校正<ref>

+

{{Cite journal

+

| doi = 10.1103/PhysRevE.83.016107

+

| pmid = 21405744

+

| volume = 83

+

| issue = 1

+

| pages = 016107

+

| last = Karrer

+

| first = Brian

+

|author2=M. E. J. Newman

+

| title = Stochastic blockmodels and community structure in networks

+

| journal = Physical Review E

+

| date = 2011-01-21

+

| arxiv = 1008.3926

+

| bibcode = 2011PhRvE..83a6107K

+

}}

+

</ref>、层次结构等变量<ref>

+

{{Cite journal

+

| doi = 10.1103/PhysRevX.4.011047

+

| volume = 4

+

| issue = 1

+

| pages = 011047

+

| last = Peixoto

+

| first = Tiago P.

+

| title = Hierarchical Block Structures and High-Resolution Model Selection in Large Networks

+

| journal = Physical Review X

+

| date = 2014-03-24

+

| arxiv = 1310.4377

+

| bibcode = 2014PhRvX...4a1047P

+

}}</ref>。模型选择可以使用一些有原则的算法，例如最小描述长度(或贝叶斯模型选择<ref>{{cite arxiv|last=P. Peixoto|first=T.|date=2017|title=Bayesian stochastic blockmodeling|eprint=1705.10225|class=stat.ML}}</ref>)和似然比检定<ref>

+

{{cite journal

+

| last = Yan

+

| first = Xiaoran

+

+

| title = Model Selection for Degree-corrected Block Models

+

|arxiv= 1207.3994

+

| date = 2012-07-17 | doi=10.1088/1742-5468/2014/05/P05007 | pmid = 26167197

+

| pmc = 4498413

+

| volume=2014 | issue = 5

+

| journal=Journal of Statistical Mechanics: Theory and Experiment | page=P05007

+

|bibcode=2014JSMTE..05..007Y}}</ref>。目前有许多算法可以对随机块模型进行有效的推理，包括置信传播<ref>

+

{{Cite journal

+

| doi = 10.1073/pnas.1221839110

+

| issn = 0027-8424

+

| volume = 110

+

| issue = 36

+

| pages = 14534–14539

+

| last = Gopalan

+

| first = Prem K.

+

|author2=David M. Blei

+

| title = Efficient discovery of overlapping communities in massive networks

+

| journal = Proceedings of the National Academy of Sciences

+

| date = 2013-09-03

+

| pmid = 23950224

+

| pmc=3767539

+

| bibcode = 2013PNAS..11014534G

+

}}</ref><ref>

+

{{Cite journal

+

| doi = 10.1103/PhysRevE.84.066106

+

| pmid = 22304154

+

| volume = 84

+

| issue = 6

+

| pages = 066106

+

| last = Decelle

+

| first = Aurelien

+

|author2=Florent Krzakala |author3=Cristopher Moore |author4=Lenka Zdeborová|author4-link= Lenka Zdeborová

+

| title = Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications

+

| journal = Physical Review E

+

| date = 2011-12-12

+

| arxiv = 1109.3041

+

| bibcode = 2011PhRvE..84f6106D

+

}}

+

</ref>和凝聚蒙特卡罗算法<ref>{{Cite journal

+

| doi = 10.1103/PhysRevE.89.012804

+

| pmid = 24580278

+

| volume = 89

+

| issue = 1

+

| pages = 012804

+

| last = Peixoto

+

| first = Tiago P.

+

| title = Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models

+

| journal = Physical Review E

+

| date = 2014-01-13

+

| arxiv = 1310.4378

+

| bibcode = 2014PhRvE..89a2804P

+

}}

+

</ref>。与尝试给定目标函数对网络进行聚类的算法不同，这类算法是基于生成模型的。生成模型不仅可以描述网络的大规模结构，而且可以用来归纳数据和发现网络中的缺失或虚假链路。<ref>{{Cite journal

+

| doi = 10.1073/pnas.0908366106

+

| volume = 106

+

| issue = 52

+

| pages = 22073–22078

+

| last = Guimerà

+

| first = Roger

+

|author2=Marta Sales-Pardo

+

| title = Missing and spurious interactions and the reconstruction of complex networks

+

| journal = Proceedings of the National Academy of Sciences

+

| date = 2009-12-29

+

| pmid=20018705

+

| pmc=2799723

+

| arxiv = 1004.4791

+

| bibcode = 2009PNAS..10622073G

+

}}</ref><ref>

+

{{Cite journal

+

| issn = 0028-0836

+

| volume = 453

+

| issue = 7191

+

| pages = 98–101

+

| last = Clauset

+

| first = Aaron

+

| author2=Cristopher Moore |author3=M. E. J. Newman

+

| title = Hierarchical structure and the prediction of missing links in networks

+

| journal = Nature

+

| date = 2008-05-01

+

| doi = 10.1038/nature06830

+

| pmid=18451861

+

| arxiv = 0811.0484

+

| bibcode = 2008Natur.453...98C

+

}}

+

</ref>

+

===基于团结构的社团检测算法===

第141行：第397行： −

如果一个团不被其他任一团所包含，即它不是其他任一团的真子集，则称该团为图的'''极大团 maximal clique''' 。基于团的社团检测算法其中的一种方法就是找到“极大团” 。(其简单思想是：生成原始图的所有子图(可能有2n-1个子图，n表示节点个数）→判断这些子图是不是团→删除非极大团的其他团）其中， '''Bron-Kerbosch 算法 The Bron–Kerbosch algorithm ''' 是找到极大团的经典算法之一。最简单的方法是只考虑大于最小尺寸(节点数)的极大团。这些团体的联合被定义为一个子图，其组件(断开的部分)被定义为社团。这些算法通常在诸如 UCInet 这样的'''社交网络分析软件 social network analysis software '''中实现。

+

如果一个团不被其他任一团所包含，即它不是其他任一团的真子集，则称该团为图的'''极大团 maximal clique''' 。基于团的社团检测算法其中的一种方法就是找到“极大团” 。(其简单思想是：生成原始图的所有子图(可能有2n-1个子图，n表示节点个数）→判断这些子图是不是团→删除非极大团的其他团）其中， '''Bron-Kerbosch 算法 The Bron–Kerbosch algorithm ''' 是找到极大团的经典算法之一。最简单的方法是只考虑大于最小尺寸(节点数)的极大团。这些团体的联合被定义为一个子图，其组件(断开的部分)被定义为社团。<ref name="Everett1998">

+

{{cite journal

+

|author1=M.G. Everett |author2=S.P. Borgatti | year = 1998

+

| title = Analyzing Clique Overlap Connections

+

| journal = Connections

+

| volume = 21

+

| pages = 49

+

}}

+

</ref> 这些算法通常在诸如 UCInet 这样的'''社交网络分析软件 social network analysis software '''中实现。

−

另一种方法是使用固定大小的团体，设大小为''k''。这些图的重叠可以用来定义一类 ''k''- 正则超图或一种团图（线图的一种推广，此时''k''=~~2）。团图的节点表示原始图中的团，而团图的边则记录原始图中团的重叠。将以前的社团检测算法~~(将每个节点分配给社团)应用到团图，然后将每个团图分配给社团，进而可以使用它来确定派系中节点间的关系。同样，由于节点可能处于多个小团体中，所以它可以是多个社团的成员。例如，'''团渗透算法 clique percolation '''将社团定义为 ''k''-clique的渗透集群。为了做到这一点，需找到一个网络中的所有 ''k'' 派系，即所有完整子图的 ''k'' 个节点。然后定义两个共享 ''k-1''个节点的 '''''k''-cliques ''k'' -团 '''为相邻的 ''k''-cliques，即用于定义团图中的边。然后将社团定义为''k''-clique的最大并集，其中任意 ''k''-clique 可以通过一系列''k''-clique邻接，从任意''k''-clique 到达其他''k''-clique ~~。也就是说，社团只是团图中的连通组件。~~ 由于一个节点可以同时属于几个不同的''k''-clique渗透簇，故社团之间可以相互重叠。

+

另一种方法是使用固定大小的团体，设大小为<math>k</math>。这些图的重叠可以用来定义一类 <math>k</math>- 正则超图或一种团图（线图的一种推广，此时<math>k=2</math>）。<ref name="Evans2010">{{cite journal

+

| author = T.S. Evans

+

| year = 2010

+

| title = Clique Graphs and Overlapping Communities

+

| journal = J. Stat. Mech.

+

| pages = P12037

+

| doi = 10.1088/1742-5468/2010/12/P12037

+

| arxiv = 1009.0638

+

| volume=2010

+

| issue=12

+

| bibcode = 2010JSMTE..12..037E

+

}}

+

</ref> 团图的节点表示原始图中的团，而团图的边则记录原始图中团的重叠。将以前的社团检测算法(将每个节点分配给社团)应用到团图，然后将每个团图分配给社团，进而可以使用它来确定派系中节点间的关系。同样，由于节点可能处于多个小团体中，所以它可以是多个社团的成员。例如，'''团渗透算法 clique percolation '''<ref>

+

{{cite journal

+

+

| title = Uncovering the overlapping community structure of complex networks in nature and society

+

| journal = Nature

+

| pmid = 15944704

+

| volume = 435

+

| issue = 7043

+

| pages = 814–818

+

| doi = 10.1038/nature03607

+

|arxiv=physics/0506133|bibcode=2005Natur.435..814P}}

+

</ref>将社团定义为<math>k</math>-clique的渗透集群。为了做到这一点，需找到一个网络中的所有 <math>k</math>派系，即所有完整子图的 <math>k</math> 个节点。然后定义两个共享 <math>k-1</math>个节点的 '''<math>k</math>-cliques <math>k</math> -团 '''为相邻的<math>k</math>-cliques，即用于定义团图中的边。然后将社团定义为<math>k</math>-clique的最大并集，其中任意<math>k</math>-clique 可以通过一系列<math>k</math>-clique邻接，从任意<math>k</math>-clique 到达其他<math>k</math>-clique 。也就是说，社团只是团图中的连通组件。由于一个节点可以同时属于几个不同的<math>k</math>-clique渗透簇，故社团之间可以相互重叠。

==社团检测算法的评价标准==

薄荷

7,129

个编辑