模体发现算法 Motif discovery algorithms

Various solutions have been proposed for the challenging problem of motif discovery. These algorithms can be classified under various paradigms such as exact counting methods, sampling methods, pattern growth methods and so on. However, motif discovery problem comprises two main steps: first, calculating the number of occurrences of a sub-graph and then, evaluating the sub-graph significance. The recurrence is significant if it is detectably far more than expected. Roughly speaking, the expected number of appearances of a sub-graph can be determined by a Null-model, which is defined by an ensemble of random networks with some of the same properties as the original network.

Here, a review on computational aspects of major algorithms is given and their related benefits and drawbacks from an algorithmic perspective are discussed.

针对模体发现这一问题存在多种解决方案。这些算法可以归纳为不同的范式:例如精确计数方法,采样方法,模式增长方法等。但模体发现问题包括两个主要步骤:首先,计算子图的出现次数,然后评估子图的重要性。如果检测到的重现性远超过预期,那么这种重现性是很显著的。粗略地讲,子图的预期出现次数可以由零模型 Null-model 确定,该模型定义为具有与原始网络某些属性相同的随机网络的集合。

接下来,对下列算法的计算原理进行简要回顾,并从算法的角度讨论了它们的优缺点。

mfinder 算法

mfinder, the first motif-mining tool, implements two kinds of motif finding algorithms: a full enumeration and a sampling method. Until 2004, the only exact counting method for NM (network motif) detection was the brute-force one proposed by Milo et al..[3] This algorithm was successful for discovering small motifs, but using this method for finding even size 5 or 6 motifs was not computationally feasible. Hence, a new approach to this problem was needed.

mfinder是第一个模体挖掘工具,它主要有两种模体查找算法:完全枚举 full enumeration 和采样方法 sampling method。直到2004年,用于NM(网络模体 networkmotif)检测的唯一精确计数方法是Milo等人提出的暴力穷举方法。[3]该算法成功地发现了小规模的模体,但是这种方法甚至对于发现规模为5个或6个的模体在计算上都不可行的。因此,需要一种解决该问题的新方法。

Kashtan et al. [9] presented the first sampling NM discovery algorithm, which was based on edge sampling throughout the network. This algorithm estimates concentrations of induced sub-graphs and can be utilized for motif discovery in directed or undirected networks. The sampling procedure of the algorithm starts from an arbitrary edge of the network that leads to a sub-graph of size two, and then expands the sub-graph by choosing a random edge that is incident to the current sub-graph. After that, it continues choosing random neighboring edges until a sub-graph of size n is obtained. Finally, the sampled sub-graph is expanded to include all of the edges that exist in the network between these n nodes. When an algorithm uses a sampling approach, taking unbiased samples is the most important issue that the algorithm might address. The sampling procedure, however, does not take samples uniformly and therefore Kashtan et al. proposed a weighting scheme that assigns different weights to the different sub-graphs within network.[9] The underlying principle of weight allocation is exploiting the information of the sampling probability for each sub-graph, i.e. the probable sub-graphs will obtain comparatively less weights in comparison to the improbable sub-graphs; hence, the algorithm must calculate the sampling probability of each sub-graph that has been sampled. This weighting technique assists mfinder to determine sub-graph concentrations impartially.

Kashtan 等人[9]首次提出了一种基于边缘采样的网络模体(NM)采样发现算法。该算法估计了所含子图 induced sub-graphs 的集中度 concentrations ,可用于有向或无向网络中的模体发现。该算法的采样过程从网络的任意一条边开始,该边连向大小为2的子图,然后选择一条与当前子图相关的随机边对子图进行扩展。之后,它将继续选择随机的相邻边,直到获得大小为n的子图为止。最后,采样得到的子图扩展为包括这n个节点在内的网络中存在的所有边。当使用采样方法时,获取无偏样本是这类算法可能面临的最重要问题。而且,采样过程并不能保证采到所有的样本(也就是不能保证得到所有的子图,译者注),因此,Kashtan 等人又提出了一种加权方案,为网络中的不同子图分配不同的权重。[9]权重分配的基本原理是利用每个子图的抽样概率信息,即,与不可能的子图相比,可能的子图将获得相对较少的权重;因此,该算法必须计算已采样的每个子图的采样概率。这种加权技术有助于mfinder公正地确定子图的集中度 concentrations

In expanded to include sharp contrast to exhaustive search, the computational time of the algorithm surprisingly is asymptotically independent of the network size. An analysis of the computational time of the algorithm has shown that it takes O(nn) for each sample of a sub-graph of size n from the network. On the other hand, there is no analysis in [9] on the classification time of sampled sub-graphs that requires solving the graph isomorphism problem for each sub-graph sample. Additionally, an extra computational effort is imposed on the algorithm by the sub-graph weight calculation. But it is unavoidable to say that the algorithm may sample the same sub-graph multiple times – spending time without gathering any information.[10] In conclusion, by taking the advantages of sampling, the algorithm performs more efficiently than an exhaustive search algorithm; however, it only determines sub-graphs concentrations approximately. This algorithm can find motifs up to size 6 because of its main implementation, and as result it gives the most significant motif, not all the others too. Also, it is necessary to mention that this tool has no option of visual presentation. The sampling algorithm is shown briefly:

与穷举搜索形成鲜明对比的是,该算法的计算时间竟然与网络大小渐近无关。对算法时间复杂度的分析表明,对于网络中大小为n的子图的每个样本,它的时间复杂度为[math]\displaystyle{ O(n^n) }[/math]。另一方面,并没有对已采样子图的每一个子图样本判断图同构问题的分类时间进行分析[9]。另外,子图权重计算将额外增加该算法的计算负担。但是不得不指出的是,该算法可能会多次采样相同的子图——花费时间而不收集任何有用信息。[10]总之,通过利用采样的优势,该算法的性能比穷举搜索算法更有效;但是,它只能大致确定子图的集中度 concentrations 。由于该算法的实现方式,使得它可以找到最大为6的模体,并且它会给出的最重要的模体,而不是其他所有模体。另外,有必要提到此工具没有可视化的呈现。采样算法简要显示如下:


mfinder
Definitions: Esis the set of picked edges. Vs is the set of all nodes that are touched by the edges in E.
Init Vs and Es to be empty sets.

1. Pick a random edge e1 = (vi, vj). Update Es = {e1}, Vs = {vi, vj}
2. Make a list L of all neighbor edges of Es. Omit from L all edges between members of Vs.
3. Pick a random edge e = {vk,vl} from L. Update Es = Es ⋃ {e}, Vs = Vs ⋃ {vk, vl}.
4. Repeat steps 2-3 until completing an n-node subgraph (until |Vs| = n).
5. Calculate the probability to sample the picked n-node subgraph.


mfinder
定义:[math]\displaystyle{ E_{s} }[/math]是采集的边集合。[math]\displaystyle{ V_{s} }[/math][math]\displaystyle{ E }[/math]中所有边的顶点的集合。
初始化[math]\displaystyle{ V_{s} }[/math][math]\displaystyle{ E_{s} }[/math]为空集。

1. 随机选择一条边[math]\displaystyle{ e_{1} = (v_{i}, v_{j}) }[/math],更新 [math]\displaystyle{ E_{s} = \{e_{1}\}, V{s} = \{v_{i}, v_{j}\} }[/math]
2. 列出[math]\displaystyle{ E{s} }[/math]的所有邻边列表[math]\displaystyle{ L }[/math],从[math]\displaystyle{ L }[/math]中删除[math]\displaystyle{ V{s} }[/math]中所有元素组成的边。
3. 从[math]\displaystyle{ L }[/math]中随机选择一条边[math]\displaystyle{ e = \{v_{k},v_{l}\} }[/math], 更新[math]\displaystyle{ E_{s} = E_{s} \cup \{e\} , V_{s} = V_{s} \cup \{v_{k}, v_{l}\} }[/math]
4. 重复步骤2-3,直到完成包含n个节点的子图 ([math]\displaystyle{ \left | V_{s} \right | = n }[/math])。
5. 计算对选取的n节点子图进行采样的概率。


FPF (Mavisto)算法

Schreiber and Schwöbbermeyer [12] proposed an algorithm named flexible pattern finder (FPF) for extracting frequent sub-graphs of an input network and implemented it in a system named Mavisto.[27] Their algorithm exploits the downward closure property which is applicable for frequency concepts F2 and F3. The downward closure property asserts that the frequency for sub-graphs decrease monotonically by increasing the size of sub-graphs; however, this property does not hold necessarily for frequency concept F1. FPF is based on a pattern tree (see figure) consisting of nodes that represents different graphs (or patterns), where the parent of each node is a sub-graph of its children nodes; in other words, the corresponding graph of each pattern tree's node is expanded by adding a new edge to the graph of its parent node.

Schreiber和Schwöbbermeyer [12]提出了一种称为灵活模式查找器(FPF)的算法,用于提取输入网络的频繁子图,并将其在名为Mavisto的系统中加以实现。[27]他们的算法利用了向下闭包特性,该特性适用于频率概念[math]\displaystyle{ F_{2} }[/math][math]\displaystyle{ F_{3} }[/math]。向下闭包性质表明,子图的频率随着子图的大小而单调下降;但这一性质并不一定适用于频率概念[math]\displaystyle{ F_{1} }[/math]。FPF算法基于模式树(见右图),由代表不同图形(或模式)的节点组成,其中每个节点的父节点是其子节点的子图;换句话说,每个模式树节点的对应图通过向其父节点图添加新边来扩展。

 
Illustration of the pattern tree in FPF algorithm..

At first, the FPF algorithm enumerates and maintains the information of all matches of a sub-graph located at the root of the pattern tree. Then, one-by-one it builds child nodes of the previous node in the pattern tree by adding one edge supported by a matching edge in the target graph, and tries to expand all of the previous information about matches to the new sub-graph (child node). In next step, it decides whether the frequency of the current pattern is lower than a predefined threshold or not. If it is lower and if downward closure holds, FPF can abandon that path and not traverse further in this part of the tree; as a result, unnecessary computation is avoided. This procedure is continued until there is no remaining path to traverse.

首先,FPF算法枚举并维护位于模式树根部的子图的所有匹配信息。然后,它通过在目标图中添加匹配边缘支持的一条边缘,在模式树中一一建立前一节点的子节点,然后通过在目标图中添加匹配边支持的一条边,逐个构建模式树中前一个节点的子节点,并尝试将以前关于匹配的所有信息拓展到新的子图(子节点)中。下一步,它判断当前模式的频率是否低于预定义的阈值。如果它低于阈值且保持向下闭包,则FPF算法会放弃该路径,而不会在树的此部分进一步遍历;这样就避免了不必要的计算。重复此过程,直到没有剩余可遍历的路径为止。

The advantage of the algorithm is that it does not consider infrequent sub-graphs and tries to finish the enumeration process as soon as possible; therefore, it only spends time for promising nodes in the pattern tree and discards all other nodes. As an added bonus, the pattern tree notion permits FPF to be implemented and executed in a parallel manner since it is possible to traverse each path of the pattern tree independently. However, FPF is most useful for frequency concepts F2 and F3, because downward closure is not applicable to F1. Nevertheless, the pattern tree is still practical for F1 if the algorithm runs in parallel. Another advantage of the algorithm is that the implementation of this algorithm has no limitation on motif size, which makes it more amenable to improvements. The pseudocode of FPF (Mavisto) is shown below:

该算法的优点是它不会考虑不频繁的子图,并尝试尽快完成枚举过程;因此,它只花时间在模式树中用于有希望的节点上,而放弃所有其他节点。还有一点额外的好处,模式树概念允许 FPF 以并行方式实现和执行,因为它可以独立地遍历模式树的每个路径。但是,FPF对于频率概念[math]\displaystyle{ F_{2} }[/math][math]\displaystyle{ F_{3} }[/math]最为有用,因为向下闭包不适用于[math]\displaystyle{ F_{1} }[/math]。尽管如此,如果算法并行运行,那么模式树对于[math]\displaystyle{ F_{1} }[/math]仍然是可行的。该算法的另一个优点是它的实现对模体大小没有限制,这使其更易于改进。FPF(Mavisto)的伪代码如下所示:


Mavisto
Data: Graph G, target pattern size t, frequency concept F.

Result: Set R of patterns of size t with maximum frequency.

R ← φ, fmax ← 0

P ←start pattern p1 of size 1
Mp1 ←all matches of p1 in G
While P ≠ φ do
    Pmax ←select all patterns from P with maximum size.
    P ← select pattern with maximum frequency from Pmax
    Ε = ExtensionLoop(G, p, Mp)
    Foreach pattern p ∈ E
        If F = F1 then f ← size(Mp)
        Else f ← Maximum Independent set(F, Mp)
        End
        If size(p) = t then
            If f = fmax then R ← R ⋃ {p}
            Else if f > fmax then R ← {p}; fmax ← f
            End
        Else
            If F = F1 or f ≥ fmax then P ← P ⋃ {p}
            End
        End
    End
End


Mavisto
数据: 图 [math]\displaystyle{ G }[/math], 目标模式规模 [math]\displaystyle{ t }[/math], 频率概念 [math]\displaystyle{ F }[/math]

结果: 以最大频率设置大小为 [math]\displaystyle{ t }[/math]的模式 [math]\displaystyle{ R }[/math].

[math]\displaystyle{ R \leftarrow \Phi , f_{max}\leftarrow 0 }[/math]

[math]\displaystyle{ P \leftarrow }[/math] 开始于大小为1的模式 [math]\displaystyle{ p_{1} }[/math]
[math]\displaystyle{ M_{p_{1}} \leftarrow }[/math][math]\displaystyle{ G }[/math] 中模式 [math]\displaystyle{ p_{1} }[/math] 的所有匹配
[math]\displaystyle{ P \neq \Phi }[/math] 时,执行:
    [math]\displaystyle{ P_{max} \leftarrow }[/math][math]\displaystyle{ P }[/math] 中选择最大规模的所有模式。
    [math]\displaystyle{ P\leftarrow }[/math][math]\displaystyle{ P_{max} }[/math] 中选择最大频率的模式
    [math]\displaystyle{ E = ExtensionLoop(G, p, M_{p}) }[/math]
    对于 [math]\displaystyle{ p \in E }[/math] 的所有模式:
        如果 [math]\displaystyle{ F = F_{1} }[/math] ,那么 [math]\displaystyle{ f \leftarrow size(M_{p}) }[/math]
        其他[math]\displaystyle{ f \leftarrow }[/math] 最大独立集 [math]\displaystyle{ (F, M_{p}) }[/math]
        结束
        如果 [math]\displaystyle{ size(p) = t }[/math] ,那么
            如果 [math]\displaystyle{ f = f_{max} }[/math] ,那么 [math]\displaystyle{ R \leftarrow R \cup \{p\} }[/math]
            其他 如果 [math]\displaystyle{ f \gt f_{max} }[/math] ,那么 [math]\displaystyle{ R \leftarrow \{p\} }[/math][math]\displaystyle{ f_{max} \leftarrow f }[/math]
            结束
        其他
            如果 [math]\displaystyle{ F = F_{1} or f \geq f_{max} }[/math] ,那么 [math]\displaystyle{ P \leftarrow P \cup \{p\} }[/math]
            结束
        结束
    结束
结束


[3]Milo R, Shen-Orr SS, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002). "Network motifs: simple building blocks of complex networks". Science. 298 (5594): 824–827. Bibcode:2002Sci...298..824M. CiteSeerX 10.1.1.225.8750. doi:10.1126/science.298.5594.824. PMID 12399590.

[9]Kashtan N, Itzkovitz S, Milo R, Alon U (2004). "Efficient sampling algorithm for estimating sub-graph concentrations and detecting network motifs". Bioinformatics. 20 (11): 1746–1758. doi:10.1093/bioinformatics/bth163. PMID 15001476.

[10]Wernicke S (2006). "Efficient detection of network motifs". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 3 (4): 347–359. CiteSeerX 10.1.1.304.2576. doi:10.1109/tcbb.2006.51. PMID 17085844.

[12]Schreiber F, Schwöbbermeyer H (2005). Frequency concepts and pattern detection for the analysis of motifs in networks. Transactions on Computational Systems Biology III. Lecture Notes in Computer Science. 3737. pp. 89–104. CiteSeerX 10.1.1.73.1130. doi:10.1007/11599128_7. ISBN 978-3-540-30883-6.

[27]Schreiber F, Schwobbermeyer H (2005). "MAVisto: a tool for the exploration of network motifs". Bioinformatics. 21 (17): 3572–3574. doi:10.1093/bioinformatics/bti556. PMID 16020473.