用户讨论:木子二月鸟
模体发现算法 Motif discovery algorithms
Various solutions have been proposed for the challenging problem of motif discovery. These algorithms can be classified under various paradigms such as exact counting methods, sampling methods, pattern growth methods and so on. However, motif discovery problem comprises two main steps: first, calculating the number of occurrences of a sub-graph and then, evaluating the sub-graph significance. The recurrence is significant if it is detectably far more than expected. Roughly speaking, the expected number of appearances of a sub-graph can be determined by a Null-model, which is defined by an ensemble of random networks with some of the same properties as the original network.
Here, a review on computational aspects of major algorithms is given and their related benefits and drawbacks from an algorithmic perspective are discussed.
针对模体发现这一问题存在多种解决方案。这些算法可以归纳为不同的范式:例如精确计数方法,采样方法,模式增长方法等。但模体发现问题包括两个主要步骤:首先,计算子图的出现次数,然后评估子图的重要性。如果检测到的重现性远超过预期,那么这种重现性是很显著的。粗略地讲,子图的预期出现次数可以由零模型 Null-model 确定,该模型定义为具有与原始网络某些属性相同的随机网络的集合。
接下来,对下列算法的计算原理进行简要回顾,并从算法的角度讨论了它们的优缺点。
mfinder 算法
mfinder, the first motif-mining tool, implements two kinds of motif finding algorithms: a full enumeration and a sampling method. Until 2004, the only exact counting method for NM (network motif) detection was the brute-force one proposed by Milo et al..[3] This algorithm was successful for discovering small motifs, but using this method for finding even size 5 or 6 motifs was not computationally feasible. Hence, a new approach to this problem was needed.
mfinder是第一个模体挖掘工具,它主要有两种模体查找算法:完全枚举 full enumeration 和采样方法 sampling method。直到2004年,用于NM(网络模体 networkmotif)检测的唯一精确计数方法是Milo等人提出的暴力穷举方法。[3]该算法成功地发现了小规模的模体,但是这种方法甚至对于发现规模为5个或6个的模体在计算上都不可行的。因此,需要一种解决该问题的新方法。
Kashtan et al. [9] presented the first sampling NM discovery algorithm, which was based on edge sampling throughout the network. This algorithm estimates concentrations of induced sub-graphs and can be utilized for motif discovery in directed or undirected networks. The sampling procedure of the algorithm starts from an arbitrary edge of the network that leads to a sub-graph of size two, and then expands the sub-graph by choosing a random edge that is incident to the current sub-graph. After that, it continues choosing random neighboring edges until a sub-graph of size n is obtained. Finally, the sampled sub-graph is expanded to include all of the edges that exist in the network between these n nodes. When an algorithm uses a sampling approach, taking unbiased samples is the most important issue that the algorithm might address. The sampling procedure, however, does not take samples uniformly and therefore Kashtan et al. proposed a weighting scheme that assigns different weights to the different sub-graphs within network.[9] The underlying principle of weight allocation is exploiting the information of the sampling probability for each sub-graph, i.e. the probable sub-graphs will obtain comparatively less weights in comparison to the improbable sub-graphs; hence, the algorithm must calculate the sampling probability of each sub-graph that has been sampled. This weighting technique assists mfinder to determine sub-graph concentrations impartially.
Kashtan 等人[9]首次提出了一种基于边缘采样的网络模体(NM)采样发现算法。该算法估计了诱导子图的集中程度 concentrations ,可用于有向或无向网络中的模体发现。该算法的采样过程从网络的任意一条边开始,该边连向大小为2的子图,然后选择一条与当前子图相关的随机边对子图进行扩展。之后,它将继续选择随机的相邻边,直到获得大小为n的子图为止。最后,采样得到的子图扩展为包括这n个节点在内的网络中存在的所有边。当使用采样方法时,获取无偏样本是这类算法可能面临的最重要问题。而且,采样过程并不能保证采到所有的样本(也就是不能保证得到所有的子图,译者注),因此,Kashtan 等人又提出了一种加权方案,为网络中的不同子图分配不同的权重。[9]权重分配的基本原理是利用每个子图的抽样概率信息,即,与不可能的子图相比,可能的子图将获得相对较少的权重;因此,该算法必须计算已采样的每个子图的采样概率。这种加权技术有助于mfinder公正地确定子图的浓度。
在扩展到与穷举搜索形成鲜明对比的过程中,算法的计算时间令人惊讶地渐近地与网络大小无关。对算法的计算时间的分析表明,对于网络中大小为n的子图的每个样本,它的取值为O(n n)。另一方面,在[9]中没有对需要解决图形同构性的采样子图的分类时间进行分析。每个子图样本的问题。另外,通过子图权重计算将额外的计算工作强加于该算法。但是不可避免的是,该算法可能会多次采样相同的子图–花时间而不收集任何信息。[10]总之,通过利用采样的优势,该算法的性能比穷举搜索算法更有效;但是,它只能大致确定子图的浓度。由于该算法的主要实现方式,它可以找到最大为6的主题,因此它给出的是最重要的主题,而其他所有主题都不是。另外,有必要提到此工具没有视觉呈现的选项。采样算法简要显示:
[3]Milo R, Shen-Orr SS, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002). "Network motifs: simple building blocks of complex networks". Science. 298 (5594): 824–827. Bibcode:2002Sci...298..824M. CiteSeerX 10.1.1.225.8750. doi:10.1126/science.298.5594.824. PMID 12399590.
[9]Kashtan N, Itzkovitz S, Milo R, Alon U (2004). "Efficient sampling algorithm for estimating sub-graph concentrations and detecting network motifs". Bioinformatics. 20 (11): 1746–1758. doi:10.1093/bioinformatics/bth163. PMID 15001476.