用户讨论:木子二月鸟

模体发现算法 Motif discovery algorithms

Various solutions have been proposed for the challenging problem of motif discovery. These algorithms can be classified under various paradigms such as exact counting methods, sampling methods, pattern growth methods and so on. However, motif discovery problem comprises two main steps: first, calculating the number of occurrences of a sub-graph and then, evaluating the sub-graph significance. The recurrence is significant if it is detectably far more than expected. Roughly speaking, the expected number of appearances of a sub-graph can be determined by a Null-model, which is defined by an ensemble of random networks with some of the same properties as the original network.

Here, a review on computational aspects of major algorithms is given and their related benefits and drawbacks from an algorithmic perspective are discussed.

mfinder 算法

mfinder, the first motif-mining tool, implements two kinds of motif finding algorithms: a full enumeration and a sampling method. Until 2004, the only exact counting method for NM (network motif) detection was the brute-force one proposed by Milo et al..[3] This algorithm was successful for discovering small motifs, but using this method for finding even size 5 or 6 motifs was not computationally feasible. Hence, a new approach to this problem was needed.

mfinder是第一个模体挖掘工具，它主要有两种模体查找算法：完全枚举 full enumeration 和采样方法 sampling method。直到2004年，用于NM（网络模体 networkmotif）检测的唯一精确计数方法是Milo等人提出的暴力穷举方法。[3]该算法成功地发现了小规模的模体，但是这种方法甚至对于发现规模为5个或6个的模体在计算上都不可行的。因此，需要一种解决该问题的新方法。

Kashtan et al. [9] presented the first sampling NM discovery algorithm, which was based on edge sampling throughout the network. This algorithm estimates concentrations of induced sub-graphs and can be utilized for motif discovery in directed or undirected networks. The sampling procedure of the algorithm starts from an arbitrary edge of the network that leads to a sub-graph of size two, and then expands the sub-graph by choosing a random edge that is incident to the current sub-graph. After that, it continues choosing random neighboring edges until a sub-graph of size n is obtained. Finally, the sampled sub-graph is expanded to include all of the edges that exist in the network between these n nodes. When an algorithm uses a sampling approach, taking unbiased samples is the most important issue that the algorithm might address. The sampling procedure, however, does not take samples uniformly and therefore Kashtan et al. proposed a weighting scheme that assigns different weights to the different sub-graphs within network.[9] The underlying principle of weight allocation is exploiting the information of the sampling probability for each sub-graph, i.e. the probable sub-graphs will obtain comparatively less weights in comparison to the improbable sub-graphs; hence, the algorithm must calculate the sampling probability of each sub-graph that has been sampled. This weighting technique assists mfinder to determine sub-graph concentrations impartially.

Kashtan 等人[9]首次提出了一种基于边缘采样的网络模体（NM）采样发现算法。该算法估计了所含子图 induced sub-graphs 的集中度 concentrations ，可用于有向或无向网络中的模体发现。该算法的采样过程从网络的任意一条边开始，该边连向大小为2的子图，然后选择一条与当前子图相关的随机边对子图进行扩展。之后，它将继续选择随机的相邻边，直到获得大小为n的子图为止。最后，采样得到的子图扩展为包括这n个节点在内的网络中存在的所有边。当使用采样方法时，获取无偏样本是这类算法可能面临的最重要问题。而且，采样过程并不能保证采到所有的样本（也就是不能保证得到所有的子图，译者注），因此，Kashtan 等人又提出了一种加权方案，为网络中的不同子图分配不同的权重。[9]权重分配的基本原理是利用每个子图的抽样概率信息，即，与不可能的子图相比，可能的子图将获得相对较少的权重；因此，该算法必须计算已采样的每个子图的采样概率。这种加权技术有助于mfinder公正地确定子图的集中度 concentrations

In expanded to include sharp contrast to exhaustive search, the computational time of the algorithm surprisingly is asymptotically independent of the network size. An analysis of the computational time of the algorithm has shown that it takes O(nn) for each sample of a sub-graph of size n from the network. On the other hand, there is no analysis in [9] on the classification time of sampled sub-graphs that requires solving the graph isomorphism problem for each sub-graph sample. Additionally, an extra computational effort is imposed on the algorithm by the sub-graph weight calculation. But it is unavoidable to say that the algorithm may sample the same sub-graph multiple times – spending time without gathering any information.[10] In conclusion, by taking the advantages of sampling, the algorithm performs more efficiently than an exhaustive search algorithm; however, it only determines sub-graphs concentrations approximately. This algorithm can find motifs up to size 6 because of its main implementation, and as result it gives the most significant motif, not all the others too. Also, it is necessary to mention that this tool has no option of visual presentation. The sampling algorithm is shown briefly:

mfinder
Definitions: Esis the set of picked edges. Vs is the set of all nodes that are touched by the edges in E.
Init Vs and Es to be empty sets.

1. Pick a random edge e1 = (vi, vj). Update Es = {e1}, Vs = {vi, vj}
2. Make a list L of all neighbor edges of Es. Omit from L all edges between members of Vs.
3. Pick a random edge e = {vk,vl} from L. Update Es = Es ⋃ {e}, Vs = Vs ⋃ {vk, vl}.
4. Repeat steps 2-3 until completing an n-node subgraph (until |Vs| = n).
5. Calculate the probability to sample the picked n-node subgraph.

mfinder

1. 随机选择一条边$\displaystyle{ e_{1} = (v_{i}, v_{j}) }$，更新 $\displaystyle{ E_{s} = \{e_{1}\}, V{s} = \{v_{i}, v_{j}\} }$
2. 列出$\displaystyle{ E{s} }$的所有邻边列表$\displaystyle{ L }$，从$\displaystyle{ L }$中删除$\displaystyle{ V{s} }$中所有元素组成的边。
3. 从$\displaystyle{ L }$中随机选择一条边$\displaystyle{ e = \{v_{k},v_{l}\} }$， 更新$\displaystyle{ E_{s} = E_{s} \cup \{e\} , V_{s} = V_{s} \cup \{v_{k}, v_{l}\} }$
4. 重复步骤2-3，直到完成包含n个节点的子图 ($\displaystyle{ \left | V_{s} \right | = n }$)。
5. 计算对选取的n节点子图进行采样的概率。

FPF (Mavisto)算法

Schreiber and Schwöbbermeyer [12] proposed an algorithm named flexible pattern finder (FPF) for extracting frequent sub-graphs of an input network and implemented it in a system named Mavisto.[27] Their algorithm exploits the downward closure property which is applicable for frequency concepts F2 and F3. The downward closure property asserts that the frequency for sub-graphs decrease monotonically by increasing the size of sub-graphs; however, this property does not hold necessarily for frequency concept F1. FPF is based on a pattern tree (see figure) consisting of nodes that represents different graphs (or patterns), where the parent of each node is a sub-graph of its children nodes; in other words, the corresponding graph of each pattern tree's node is expanded by adding a new edge to the graph of its parent node.

Schreiber和Schwöbbermeyer [12]提出了一种称为灵活模式查找器（FPF）的算法，用于提取输入网络的频繁子图，并将其在名为Mavisto的系统中加以实现。[27]他们的算法利用了向下闭包特性，该特性适用于频率概念$\displaystyle{ F_{2} }$$\displaystyle{ F_{3} }$。向下闭包性质表明，子图的频率随着子图的大小而单调下降；但这一性质并不一定适用于频率概念$\displaystyle{ F_{1} }$。FPF算法基于模式树（见右图），由代表不同图形（或模式）的节点组成，其中每个节点的父节点是其子节点的子图；换句话说，每个模式树节点的对应图通过向其父节点图添加新边来扩展。

Illustration of the pattern tree in FPF algorithm..

At first, the FPF algorithm enumerates and maintains the information of all matches of a sub-graph located at the root of the pattern tree. Then, one-by-one it builds child nodes of the previous node in the pattern tree by adding one edge supported by a matching edge in the target graph, and tries to expand all of the previous information about matches to the new sub-graph (child node). In next step, it decides whether the frequency of the current pattern is lower than a predefined threshold or not. If it is lower and if downward closure holds, FPF can abandon that path and not traverse further in this part of the tree; as a result, unnecessary computation is avoided. This procedure is continued until there is no remaining path to traverse.

The advantage of the algorithm is that it does not consider infrequent sub-graphs and tries to finish the enumeration process as soon as possible; therefore, it only spends time for promising nodes in the pattern tree and discards all other nodes. As an added bonus, the pattern tree notion permits FPF to be implemented and executed in a parallel manner since it is possible to traverse each path of the pattern tree independently. However, FPF is most useful for frequency concepts F2 and F3, because downward closure is not applicable to F1. Nevertheless, the pattern tree is still practical for F1 if the algorithm runs in parallel. Another advantage of the algorithm is that the implementation of this algorithm has no limitation on motif size, which makes it more amenable to improvements. The pseudocode of FPF (Mavisto) is shown below:

Mavisto
Data: Graph G, target pattern size t, frequency concept F.

Result: Set R of patterns of size t with maximum frequency.

R ← φ, fmax ← 0

P ←start pattern p1 of size 1
Mp1 ←all matches of p1 in G
While P ≠ φ do
Pmax ←select all patterns from P with maximum size.
P ← select pattern with maximum frequency from Pmax
Ε = ExtensionLoop(G, p, Mp)
Foreach pattern p ∈ E
If F = F1 then f ← size(Mp)
Else f ← Maximum Independent set(F, Mp)
End
If size(p) = t then
If f = fmax then R ← R ⋃ {p}
Else if f > fmax then R ← {p}; fmax ← f
End
Else
If F = F1 or f ≥ fmax then P ← P ⋃ {p}
End
End
End
End

Mavisto

$\displaystyle{ R \leftarrow \Phi , f_{max}\leftarrow 0 }$

$\displaystyle{ P \leftarrow }$ 开始于大小为1的模式 $\displaystyle{ p_{1} }$
$\displaystyle{ M_{p_{1}} \leftarrow }$$\displaystyle{ G }$ 中模式 $\displaystyle{ p_{1} }$ 的所有匹配
$\displaystyle{ P \neq \Phi }$ 时，执行：
$\displaystyle{ P_{max} \leftarrow }$$\displaystyle{ P }$ 中选择最大规模的所有模式。
$\displaystyle{ P\leftarrow }$$\displaystyle{ P_{max} }$ 中选择最大频率的模式
$\displaystyle{ E = ExtensionLoop(G, p, M_{p}) }$
对于 $\displaystyle{ p \in E }$ 的所有模式：
如果 $\displaystyle{ F = F_{1} }$ ，那么 $\displaystyle{ f \leftarrow size(M_{p}) }$
其他$\displaystyle{ f \leftarrow }$ 最大独立集 $\displaystyle{ (F, M_{p}) }$
结束
如果 $\displaystyle{ size(p) = t }$ ，那么
如果 $\displaystyle{ f = f_{max} }$ ，那么 $\displaystyle{ R \leftarrow R \cup \{p\} }$
其他 如果 $\displaystyle{ f \gt f_{max} }$ ，那么 $\displaystyle{ R \leftarrow \{p\} }$$\displaystyle{ f_{max} \leftarrow f }$
结束
其他
如果 $\displaystyle{ F = F_{1} or f \geq f_{max} }$ ，那么 $\displaystyle{ P \leftarrow P \cup \{p\} }$
结束
结束
结束

[3]Milo R, Shen-Orr SS, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002). "Network motifs: simple building blocks of complex networks". Science. 298 (5594): 824–827. Bibcode:2002Sci...298..824M. CiteSeerX 10.1.1.225.8750. doi:10.1126/science.298.5594.824. PMID 12399590.

[9]Kashtan N, Itzkovitz S, Milo R, Alon U (2004). "Efficient sampling algorithm for estimating sub-graph concentrations and detecting network motifs". Bioinformatics. 20 (11): 1746–1758. doi:10.1093/bioinformatics/bth163. PMID 15001476.

[10]Wernicke S (2006). "Efficient detection of network motifs". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 3 (4): 347–359. CiteSeerX 10.1.1.304.2576. doi:10.1109/tcbb.2006.51. PMID 17085844.

[12]Schreiber F, Schwöbbermeyer H (2005). Frequency concepts and pattern detection for the analysis of motifs in networks. Transactions on Computational Systems Biology III. Lecture Notes in Computer Science. 3737. pp. 89–104. CiteSeerX 10.1.1.73.1130. doi:10.1007/11599128_7. ISBN 978-3-540-30883-6.

[27]Schreiber F, Schwobbermeyer H (2005). "MAVisto: a tool for the exploration of network motifs". Bioinformatics. 21 (17): 3572–3574. doi:10.1093/bioinformatics/bti556. PMID 16020473.