更改

添加6字节 、 2020年5月7日 (四) 00:49
无编辑摘要
第100行: 第100行:     
Various solutions have been proposed for the challenging problem of motif discovery. These algorithms can be classified under various paradigms such as exact counting methods, sampling methods, pattern growth methods and so on. However, motif discovery problem comprises two main steps: first, calculating the number of occurrences of a sub-graph and then, evaluating the sub-graph significance. The recurrence is significant if it is detectably far more than expected. Roughly speaking, the expected number of appearances of a sub-graph can be determined by a Null-model, which is defined by an ensemble of random networks with some of the same properties as the original network.
 
Various solutions have been proposed for the challenging problem of motif discovery. These algorithms can be classified under various paradigms such as exact counting methods, sampling methods, pattern growth methods and so on. However, motif discovery problem comprises two main steps: first, calculating the number of occurrences of a sub-graph and then, evaluating the sub-graph significance. The recurrence is significant if it is detectably far more than expected. Roughly speaking, the expected number of appearances of a sub-graph can be determined by a Null-model, which is defined by an ensemble of random networks with some of the same properties as the original network.
 +
 +
针对模体发现这一问题存在多种解决方案。这些算法可以归纳为不同的范式:例如精确计数方法,采样方法,模式增长方法等。但模体发现问题包括两个主要步骤:首先,计算子图的出现次数,然后评估子图的重要性。如果检测到的重现性远超过预期,那么这种重现性是很显著的。粗略地讲,子图的预期出现次数可以由'''零模型 Null-model''' 确定,该模型定义为具有与原始网络某些属性相同的随机网络的集合。
 +
    
Here, a review on computational aspects of major algorithms is given and their related benefits and drawbacks from an algorithmic perspective are discussed.
 
Here, a review on computational aspects of major algorithms is given and their related benefits and drawbacks from an algorithmic perspective are discussed.
  −
针对模体发现这一问题存在多种解决方案。这些算法可以归纳为不同的范式:例如精确计数方法,采样方法,模式增长方法等。但模体发现问题包括两个主要步骤:首先,计算子图的出现次数,然后评估子图的重要性。如果检测到的重现性远超过预期,那么这种重现性是很显著的。粗略地讲,子图的预期出现次数可以由'''零模型 Null-model''' 确定,该模型定义为具有与原始网络某些属性相同的随机网络的集合。
      
接下来,对下列算法的计算原理进行简要回顾,并从算法的角度讨论了它们的优缺点。
 
接下来,对下列算法的计算原理进行简要回顾,并从算法的角度讨论了它们的优缺点。
第112行: 第113行:     
'''mfinder'''是第一个模体挖掘工具,它主要有两种模体查找算法:完全枚举 full enumeration 和采样方法 sampling method。直到2004年,用于NM('''网络模体 networkmotif''')检测的唯一精确计数方法是'''Milo'''等人提出的暴力穷举方法。<ref name="mil1" />该算法成功地发现了小规模的模体,但是这种方法甚至对于发现规模为5个或6个的模体在计算上都不可行的。因此,需要一种解决该问题的新方法。
 
'''mfinder'''是第一个模体挖掘工具,它主要有两种模体查找算法:完全枚举 full enumeration 和采样方法 sampling method。直到2004年,用于NM('''网络模体 networkmotif''')检测的唯一精确计数方法是'''Milo'''等人提出的暴力穷举方法。<ref name="mil1" />该算法成功地发现了小规模的模体,但是这种方法甚至对于发现规模为5个或6个的模体在计算上都不可行的。因此,需要一种解决该问题的新方法。
 +
    
Kashtan ''et al.'' <ref name="kas1" /> presented the first sampling NM discovery algorithm, which was based on ''edge sampling'' throughout the network. This algorithm estimates concentrations of induced sub-graphs and can be utilized for motif discovery in directed or undirected networks. The sampling procedure of the algorithm starts from an arbitrary edge of the network that leads to a sub-graph of size two, and then expands the sub-graph by choosing a random edge that is incident to the current sub-graph. After that, it continues choosing random neighboring edges until a sub-graph of size n is obtained. Finally, the sampled sub-graph is expanded to include all of the edges that exist in the network between these n nodes. When an algorithm uses a sampling approach, taking unbiased samples is the most important issue that the algorithm might address. The sampling procedure, however, does not take samples uniformly and therefore Kashtan ''et al.'' proposed a weighting scheme that assigns different weights to the different sub-graphs within network.<ref name="kas1" /> The underlying principle of weight allocation is exploiting the information of the [[sampling probability]] for each sub-graph, i.e. the probable sub-graphs will obtain comparatively less weights in comparison to the improbable sub-graphs; hence, the algorithm must calculate the sampling probability of each sub-graph that has been sampled. This weighting technique assists ''mfinder'' to determine sub-graph concentrations impartially.
 
Kashtan ''et al.'' <ref name="kas1" /> presented the first sampling NM discovery algorithm, which was based on ''edge sampling'' throughout the network. This algorithm estimates concentrations of induced sub-graphs and can be utilized for motif discovery in directed or undirected networks. The sampling procedure of the algorithm starts from an arbitrary edge of the network that leads to a sub-graph of size two, and then expands the sub-graph by choosing a random edge that is incident to the current sub-graph. After that, it continues choosing random neighboring edges until a sub-graph of size n is obtained. Finally, the sampled sub-graph is expanded to include all of the edges that exist in the network between these n nodes. When an algorithm uses a sampling approach, taking unbiased samples is the most important issue that the algorithm might address. The sampling procedure, however, does not take samples uniformly and therefore Kashtan ''et al.'' proposed a weighting scheme that assigns different weights to the different sub-graphs within network.<ref name="kas1" /> The underlying principle of weight allocation is exploiting the information of the [[sampling probability]] for each sub-graph, i.e. the probable sub-graphs will obtain comparatively less weights in comparison to the improbable sub-graphs; hence, the algorithm must calculate the sampling probability of each sub-graph that has been sampled. This weighting technique assists ''mfinder'' to determine sub-graph concentrations impartially.
    
'''Kashtan''' 等人<ref name="kas1" />首次提出了一种基于边缘采样的网络模体(NM)采样发现算法。该算法估计了<font color="red">所含子图 induced sub-graphs 的集中度 concentrations </font>,可用于有向或无向网络中的模体发现。该算法的采样过程从网络的任意一条边开始,该边连向大小为2的子图,然后选择一条与当前子图相关的随机边对子图进行扩展。之后,它将继续选择随机的相邻边,直到获得大小为n的子图为止。最后,采样得到的子图扩展为包括这n个节点在内的网络中存在的所有边。当使用采样方法时,获取无偏样本是这类算法可能面临的最重要问题。而且,采样过程并不能保证采到所有的样本(也就是不能保证得到所有的子图,译者注),因此,Kashtan 等人又提出了一种加权方案,为网络中的不同子图分配不同的权重。<ref name="kas1" /> 权重分配的基本原理是利用每个子图的抽样概率信息,即,与不可能的子图相比,可能的子图将获得相对较少的权重;因此,该算法必须计算已采样的每个子图的采样概率。这种加权技术有助于mfinder公正地确定子图的<font color="red">集中度 concentrations </font>。
 
'''Kashtan''' 等人<ref name="kas1" />首次提出了一种基于边缘采样的网络模体(NM)采样发现算法。该算法估计了<font color="red">所含子图 induced sub-graphs 的集中度 concentrations </font>,可用于有向或无向网络中的模体发现。该算法的采样过程从网络的任意一条边开始,该边连向大小为2的子图,然后选择一条与当前子图相关的随机边对子图进行扩展。之后,它将继续选择随机的相邻边,直到获得大小为n的子图为止。最后,采样得到的子图扩展为包括这n个节点在内的网络中存在的所有边。当使用采样方法时,获取无偏样本是这类算法可能面临的最重要问题。而且,采样过程并不能保证采到所有的样本(也就是不能保证得到所有的子图,译者注),因此,Kashtan 等人又提出了一种加权方案,为网络中的不同子图分配不同的权重。<ref name="kas1" /> 权重分配的基本原理是利用每个子图的抽样概率信息,即,与不可能的子图相比,可能的子图将获得相对较少的权重;因此,该算法必须计算已采样的每个子图的采样概率。这种加权技术有助于mfinder公正地确定子图的<font color="red">集中度 concentrations </font>。
 +
    
In expanded to include sharp contrast to exhaustive search, the computational time of the algorithm surprisingly is asymptotically independent of the network size. An analysis of the computational time of the algorithm has shown that it takes {{math|O(n<sup>n</sup>)}} for each sample of a sub-graph of size {{math|n}} from the network. On the other hand, there is no analysis in <ref name="kas1" /> on the classification time of sampled sub-graphs that requires solving the ''graph isomorphism'' problem for each sub-graph sample. Additionally, an extra computational effort is imposed on the algorithm by the sub-graph weight calculation. But it is unavoidable to say that the algorithm may sample the same sub-graph multiple times – spending time without gathering any information.<ref name="wer1" /> In conclusion, by taking the advantages of sampling, the algorithm performs more efficiently than an exhaustive search algorithm; however, it only determines sub-graphs concentrations approximately.  This algorithm can find motifs up to size 6 because of its main implementation, and as result it gives the most significant motif, not all the others too. Also, it is necessary to mention that this tool has no option of visual presentation. The sampling algorithm is shown briefly:
 
In expanded to include sharp contrast to exhaustive search, the computational time of the algorithm surprisingly is asymptotically independent of the network size. An analysis of the computational time of the algorithm has shown that it takes {{math|O(n<sup>n</sup>)}} for each sample of a sub-graph of size {{math|n}} from the network. On the other hand, there is no analysis in <ref name="kas1" /> on the classification time of sampled sub-graphs that requires solving the ''graph isomorphism'' problem for each sub-graph sample. Additionally, an extra computational effort is imposed on the algorithm by the sub-graph weight calculation. But it is unavoidable to say that the algorithm may sample the same sub-graph multiple times – spending time without gathering any information.<ref name="wer1" /> In conclusion, by taking the advantages of sampling, the algorithm performs more efficiently than an exhaustive search algorithm; however, it only determines sub-graphs concentrations approximately.  This algorithm can find motifs up to size 6 because of its main implementation, and as result it gives the most significant motif, not all the others too. Also, it is necessary to mention that this tool has no option of visual presentation. The sampling algorithm is shown briefly:
第164行: 第167行:     
Schreiber和Schwöbbermeyer <ref name="schr1" />提出了一种称为灵活模式查找器(FPF)的算法,用于提取输入网络的频繁子图,并将其在名为Mavisto的系统中加以实现。<ref name="schr2">{{cite journal |vauthors=Schreiber F, Schwobbermeyer H |title=MAVisto: a tool for the exploration of network motifs |journal=Bioinformatics |volume=21 |issue=17|pages=3572–3574 |year=2005 |doi=10.1093/bioinformatics/bti556|pmid=16020473 |doi-access=free }}</ref> 他们的算法利用了向下闭包特性,该特性适用于频率概念<math>F_{2}</math>和<math>F_{3}</math>。向下闭包性质表明,子图的频率随着子图的大小而单调下降;但这一性质并不一定适用于频率概念<math>F_{1}</math>。FPF算法基于模式树(见右图),由代表不同图形(或模式)的节点组成,其中每个节点的父节点是其子节点的子图;换句话说,每个模式树节点的对应图通过向其父节点图添加新边来扩展。
 
Schreiber和Schwöbbermeyer <ref name="schr1" />提出了一种称为灵活模式查找器(FPF)的算法,用于提取输入网络的频繁子图,并将其在名为Mavisto的系统中加以实现。<ref name="schr2">{{cite journal |vauthors=Schreiber F, Schwobbermeyer H |title=MAVisto: a tool for the exploration of network motifs |journal=Bioinformatics |volume=21 |issue=17|pages=3572–3574 |year=2005 |doi=10.1093/bioinformatics/bti556|pmid=16020473 |doi-access=free }}</ref> 他们的算法利用了向下闭包特性,该特性适用于频率概念<math>F_{2}</math>和<math>F_{3}</math>。向下闭包性质表明,子图的频率随着子图的大小而单调下降;但这一性质并不一定适用于频率概念<math>F_{1}</math>。FPF算法基于模式树(见右图),由代表不同图形(或模式)的节点组成,其中每个节点的父节点是其子节点的子图;换句话说,每个模式树节点的对应图通过向其父节点图添加新边来扩展。
 +
    
[[Image:The pattern tree in FPF algorithm.jpg|right|thumb|''FPF算法中的模式树展示''.<ref name="schr1" />]]
 
[[Image:The pattern tree in FPF algorithm.jpg|right|thumb|''FPF算法中的模式树展示''.<ref name="schr1" />]]
 +
    
At first, the FPF algorithm enumerates and maintains the information of all matches of a sub-graph located at the root of the pattern tree. Then, one-by-one it builds child nodes of the previous node in the pattern tree by adding one edge supported by a matching edge in the target graph, and tries to expand all of the previous information about matches to the new sub-graph (child node). In next step, it decides whether the frequency of the current pattern is lower than a predefined threshold or not. If it is lower and if downward closure holds, FPF can abandon that path and not traverse further in this part of the tree; as a result, unnecessary computation is avoided. This procedure is continued until there is no remaining path to traverse.
 
At first, the FPF algorithm enumerates and maintains the information of all matches of a sub-graph located at the root of the pattern tree. Then, one-by-one it builds child nodes of the previous node in the pattern tree by adding one edge supported by a matching edge in the target graph, and tries to expand all of the previous information about matches to the new sub-graph (child node). In next step, it decides whether the frequency of the current pattern is lower than a predefined threshold or not. If it is lower and if downward closure holds, FPF can abandon that path and not traverse further in this part of the tree; as a result, unnecessary computation is avoided. This procedure is continued until there is no remaining path to traverse.
    
首先,FPF算法枚举并维护位于模式树根部的子图的所有匹配信息。然后,它通过在目标图中添加匹配边缘支持的一条边缘,在模式树中一一建立前一节点的子节点,然后通过在目标图中添加匹配边支持的一条边,逐个构建模式树中前一个节点的子节点,并尝试将以前关于匹配的所有信息拓展到新的子图(子节点)中。下一步,它判断当前模式的频率是否低于预定义的阈值。如果它低于阈值且保持向下闭包,则FPF算法会放弃该路径,而不会在树的此部分进一步遍历;这样就避免了不必要的计算。重复此过程,直到没有剩余可遍历的路径为止。
 
首先,FPF算法枚举并维护位于模式树根部的子图的所有匹配信息。然后,它通过在目标图中添加匹配边缘支持的一条边缘,在模式树中一一建立前一节点的子节点,然后通过在目标图中添加匹配边支持的一条边,逐个构建模式树中前一个节点的子节点,并尝试将以前关于匹配的所有信息拓展到新的子图(子节点)中。下一步,它判断当前模式的频率是否低于预定义的阈值。如果它低于阈值且保持向下闭包,则FPF算法会放弃该路径,而不会在树的此部分进一步遍历;这样就避免了不必要的计算。重复此过程,直到没有剩余可遍历的路径为止。
 +
    
The advantage of the algorithm is that it does not consider infrequent sub-graphs and tries to finish the enumeration process as soon as possible; therefore, it only spends time for promising nodes in the pattern tree and discards all other nodes. As an added bonus, the pattern tree notion permits FPF to be implemented and executed in a parallel manner since it is possible to traverse each path of the pattern tree independently. However, FPF is most useful for frequency concepts {{math|F<sub>2</sub>}} and {{math|F<sub>3</sub>}}, because downward closure is not applicable to {{math|F<sub>1</sub>}}. Nevertheless, the pattern tree is still practical for {{math|F<sub>1</sub>}} if the algorithm runs in parallel. Another advantage of the algorithm is that the implementation of this algorithm has no limitation on motif size, which makes it more amenable to improvements. The pseudocode of FPF (Mavisto) is shown below:
 
The advantage of the algorithm is that it does not consider infrequent sub-graphs and tries to finish the enumeration process as soon as possible; therefore, it only spends time for promising nodes in the pattern tree and discards all other nodes. As an added bonus, the pattern tree notion permits FPF to be implemented and executed in a parallel manner since it is possible to traverse each path of the pattern tree independently. However, FPF is most useful for frequency concepts {{math|F<sub>2</sub>}} and {{math|F<sub>3</sub>}}, because downward closure is not applicable to {{math|F<sub>1</sub>}}. Nevertheless, the pattern tree is still practical for {{math|F<sub>1</sub>}} if the algorithm runs in parallel. Another advantage of the algorithm is that the implementation of this algorithm has no limitation on motif size, which makes it more amenable to improvements. The pseudocode of FPF (Mavisto) is shown below:
106

个编辑