贝叶斯网络

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
薄荷讨论 | 贡献2021年1月24日 (日) 20:39的版本 →‎推理与学习
跳到导航 跳到搜索

此词条暂由水流心不竞初译,未经审校,带来阅读不便,请见谅。

本词条已由Ricky审校。


贝叶斯网络 Bayesian network(BN)信念网络 belief network决策网络 decision network贝叶斯模型 Bayes(ian) model概率有向无环图模型 probabilistic directed acyclic graphical model是一种概率图模型(一种统计模型),它通过有向无环图无环图 directed acyclic graph(DAG)表示一组随机变量及其条件依赖关系。贝叶斯网络是一种理想的分析工具,用来预测一个事件的发生是由已知原因中的哪一个(些)引起的。例如,贝叶斯网络可以表示疾病和症状之间的概率关系。在给定症状的情况下,该网络可用于计算各种疾病出现的概率。


贝叶斯网络有多种变体。用来建模序列变量(例如,语音信号或蛋白质序列)的贝叶斯网络被称为动态贝叶斯网络 dynamic Bayesian network。贝叶斯网络可以进一步扩展,用来表示和解决在不确定性因素下的决策问题,这种扩展称为影响图 influence diagram。现在已有高效的算法学习出贝叶斯网络的结构,并通过贝叶斯网络做推理。


图模型

在形式上,贝叶斯网络是有向无环图 Directed acyclic graph(DAG),其节点表示随机变量(其概率为贝叶斯概率); 它们可以是可观测量变量、隐变量、未知参数或假设。图中的边表示条件依赖;未连接(没有路径连接一个节点到另一个节点)的节点表示彼此之间条件独立。每个节点都与一个概率函数 Probability function 节点相关联,该函数把所有父节点代表的变量值作为输入,并给出该节点表示的随机变量的概率(或概率分布)。例如,如果 [math]\displaystyle{ m }[/math] 父节点表示 [math]\displaystyle{ m }[/math] 布尔变量,那么概率函数可以用一个包含[math]\displaystyle{ 2^m }[/math]行的表格表示,每一行代表一种(父节点)变量值的组合,以及对应的子节点变量的概率值。类似的想法可以应用于有环无向图,如马尔可夫网络 Markov network


举例

一个简单的贝叶斯网络,及其条件概率表

草地变得湿润,可能有两种原因:主动洒水或者下雨。雨对洒水车的使用有直接的影响(也就是说,当下雨时,洒水车通常是不工作的)。这种情况可以用贝叶斯网络来模拟(如右图所示)。每个变量有两个可能的值,T (表示真)和 F (表示假)。


对应的联合概率函数 Joint probability distribution是:


[math]\displaystyle{ \Pr(G,S,R)=\Pr(G\mid S,R) \Pr(S\mid R)\Pr(R) }[/math]


其中G表示“草地湿了(是/否)”,S表示“洒水器打开(是/否)”,R表示“下雨(是/否)”。


这个模型可以回答在给定一个结果的情况下一个原因是否存在的问题,比如“给定草是湿的,下雨的概率是多少? ”通过使用条件概率公式并对所有干扰变量 nuisance variable的求和:


[math]\displaystyle{ \Pr(R=T\mid G=T) =\frac{\Pr(G=T,R=T)}{\Pr(G=T)} = \frac{\sum_{S \in \{T, F\}}\Pr(G=T, S,R=T)}{\sum_{S, R \in \{T, F\}} \Pr(G=T,S,R)} }[/math]


展开概率函数[math]\displaystyle{ \Pr(G,S,R) }[/math] ,并使用图中列出的条件概率,我们可以算出分子和分母中的各个项。比如说,


[math]\displaystyle{ \begin{align} \Pr(G=T, S=T,R=T) & = \Pr(G=T\mid S=T,R=T)\Pr(S=T\mid R=T)\Pr(R=T) \\ & = 0.99 \times 0.01 \times 0.2 \\ & = 0.00198. \end{align} }[/math]


算出来的结果是


[math]\displaystyle{ \Pr(R=T\mid G=T) = \frac{ 0.00198_{TTT} + 0.1584_{TFT} }{ 0.00198_{TTT} + 0.288_{TTF} + 0.1584_{TFT} + 0.0_{TFF} } = \frac{891}{2491}\approx 35.77 \%. }[/math]


这个模型还回答干预性的问题,比如“现在我们把草弄湿了,那么下雨的可能性有多大? ”答案取决于干预后的联合分布函数:


[math]\displaystyle{ \Pr(S,R\mid\text{do}(G=T)) = \Pr(S\mid R) \Pr(R) }[/math]


该分布通过从干预前的分布中去除因子[math]\displaystyle{ \Pr(G\mid S,R) }[/math]得到,其中do算子强行使 G 的值为真。演算后可知下雨的可能性不受此干预的影响:


[math]\displaystyle{ \Pr(R\mid\text{do}(G=T)) = \Pr(R). }[/math]


现在再预测开启洒水装置的影响:


[math]\displaystyle{ \Pr(R,G\mid\text{do}(S=T)) = \Pr(R)\Pr(G\mid R,S=T) }[/math]


移除[math]\displaystyle{ \Pr(S=T\mid R) }[/math] 这个项表明这种行为影响的是草,而不是雨。


考虑到未观测变量,这些预测可能并不可行,就像大多数策略评估问题一样。但是,只要满足后门准则,仍然可以预测 [math]\displaystyle{ \text{do}(x) }[/math]的效果。[1][2] 如果一组观察到的变量Z能d-分隔(或阻塞)[3]XY 的所有"后门路径 back-door path,则有


[math]\displaystyle{ \Pr(Y,Z\mid\text{do}(x)) = \frac{\Pr(Y,Z,X=x)}{\Pr(X=x\mid Z)}. }[/math]


后门路径是一条箭头指向X的路径。满足后门标准的(观测变量)集合称为“充分的”或“有效的”。例如,集合 Z = R 能有效地预测 S = TG 的影响,因为 R d-分隔了(仅有的)后门路径 S ← R → G。但是,如果 S 没有被观测到,没有其他观测变量集合来 d-分隔这条路径,那就不能从观测数据中预测到“喷头被打开”(S = T)对于草地G的影响。在这种情况下,P(G | do(S = T))就没有被“识别”。这反映了一个事实:在缺乏干预性数据的情况下无法确认观察到的 SG 之间的依赖关系是不是一种因果关系(可能由共同原因引起的强相关,比如辛普森悖论)。


(由共同原因引起的明显依赖关系,r)。(见辛普森悖论 Simpson's paradox


为了确定一个因果关系是否可以从一个含有未观测变量的贝叶斯网络中识别出来,我们可以使用“ do-演算”[1][4] 的三个规则来检验是否所有的 do 项都可以从这个关系的表达式中去掉,从而确认所需的量可以从数据中估计出来。[5]


如果依赖关系在联合分布中是稀疏的(变量间依赖较少,即对应的图模型里的边较少),那么相对于存储一张完整的概率表,使用贝叶斯网络可以节省相当多的内存。例如,将10个二值变量的条件概率存储为一个表的,需要存储[math]\displaystyle{ 2^{10} = 1024 }[/math] 个值。而在每个变量最多依赖3个父变量的情况下,使用贝叶斯网络表示最多只存储[math]\displaystyle{ 10\cdot2^3 = 80 }[/math]个值。


相比于完全版的联合概率分布,理解(一组稀疏的)直接的变量间依赖关系和局部的概率分布对于人类来说要更加直观易懂。这正是贝叶斯网络的一个优点。


推理与学习

贝叶斯网络的推理和学习主要包含三个任务:

推断未观测变量

因为贝叶斯网络是变量及其关系的完整模型,所以它可以用来回答关于变量的概率问题。例如,当网络中某些变量被观测到时,也就是其确切值已知的情况下,我们可以更新对网络中其他未观测变量的认知。这种给定一些证据,然后计算的变量的后验概率的过程被称为概率推理 probabilistic inference。有时我们需要为某些变量赋值的同时最小化期望损失,此时后验概率为一些检测应用,例如检测决策错误的概率,提供了一个通用且充分的统计量。因此,贝叶斯网络可以被看作是一种应用贝叶斯定理解决复杂问题的自动化机制。


最常用的精确推理方法有: 变量消除法,即通过积分或求和的方式将未观察到的非查询变量逐一消除;团树传播算法,它将计算过程缓存,可以同时查询多个变量,并快速传播新的证据; 递归条件化 AND/OR 搜索法,它考虑了时间和空间取舍,并且在使用足够的空间时拥有与变量消除法相比拟的时间效率。所有这些方法的复杂度都随网络树宽的增长呈指数级上升。而最常见的近似推理算法有重要性抽样法 importance sampling随机 MCMC 模拟法 Markov chain Monte Carlo小桶消去法 循环置信传播法 loopy belief propagation广义置信传播法 generalized belief propagation变分法 variational Bayes


参数学习

为了获得一个完整的贝叶斯网络,从而获得一个完整的联合概率分布,需要详知道每个节点 X 的条件概率分布。X 的条件概率分布可以有各种各样的形式,最长见到、用到的分布是离散分布或高斯分布,因为计算简便。有时我们只知道一个分布应该符合哪些约束条件。此时可以使用最大熵原理来确定一个分布,即给定约束条件下熵最大的分布。类似地,在动态贝叶斯网络B的特定上下文中,隐变量的条件分布通常也被认为是能最大化其随机过程的熵率的那一个分布。


通常这些条件分布包括未知的参数,必须从数据中估计出来,例如,使用最大似然估计 maximum likelihood。给定未观测的变量,直接进行最大化似然(或后验概率)往往是很复杂的。这个问题的一个经典解决方案是EM算法。它利用观测数据计算出未观测变量的期望值,并假设先前计算的期望值是正确的,然后最大化完全似然(或后验),求期望和最大化两个步骤交替迭代进行。在某些条件下,这个过程能收敛于待估计参数的最大似然值(或最大后验值)。


一个更彻底的贝叶斯估计方法是将参数视为附加的未观测变量,并根据观测数据计算包含所有节点的完整后验概率分布,然后积分出参数。这种方法可能代价高昂,并得出一个大规模的模型,但能使经典的参数估计方法更易于处理。


结构学习

在最简单的情况下,一个贝叶斯网络可以有领域专家人工构建,然后用它来执行推理。在其他应用程序中,构建网络的任务对于人类来说过于复杂。这种情况就必须从数据中学习网络结构和各个变量的局部分布。


自动化学习出贝叶斯网络的图形结构是机器学习领域的一个挑战。其基本思想可以追溯到由 Rebane 和 Pearl 提出的恢复算法。[6]该算法的基础是三节点有向无环图中的三种可能模式:

Junction patterns
Pattern Model
Chain [math]\displaystyle{ X \rightarrow Y \rightarrow Z }[/math]
Fork [math]\displaystyle{ X \leftarrow Y \rightarrow Z }[/math]
Collider [math]\displaystyle{ X \rightarrow Y \leftarrow Z }[/math]

前2个种模式表示的依赖关系是相同的(在给定[math]\displaystyle{ Y }[/math] 时,[math]\displaystyle{ X }[/math][math]\displaystyle{ Z }[/math]是独立的),因此无法区分。然而,第三种模式确实可以被识别的,因为[math]\displaystyle{ X }[/math][math]\displaystyle{ Z }[/math]是边际独立的,而其他所有变量对都是依赖的。因此,虽然这三个节点的连接方式和前两种模式相同,但这种模式中箭头的方向是部分可识别的。当[math]\displaystyle{ X }[/math][math]\displaystyle{ Z }[/math]有着共同的父母时,这种不同也能被识别。这个算法已经发展到能系统地确定图的框架,然后由从数据中观察到的条件独立性确定所有箭头的方向。[1][7][8][9]


另一种结构学习的方法是基于优化的搜索。它需要一个评分函数和一个搜索策略。一个常见的评分函数是给定训练数据的结构的后验概率 posterior probability,比如 BIC 或 BDeu。穷举搜索可以获得能最大化得分函数的结构,但其时间要求随变量数增长呈超指数增长。局部搜索策略则进行增量改变,一步步迭代式改进结构的得分。像马尔可夫链蒙特卡洛这样的全局搜索算法可以避免陷入局部最优。 Friedman 等人[10][11] 认为可以使用变量之间的互信息,并找到一个讷讷感最大化互信息的结构,方法是通过将候选的父选节点集限制为k个节点并在其中进行穷尽式搜索。


一个特别快速精确学习出贝叶斯网络的方法是把问题转化为一个最优化问题,并使用整数规划 integer program (IP)解决它。以切平面 Cutting-plane method的形式求解整数规划问题时,在整数规划中加入了不规则性约束,[12]这种方法可以处理多达100个变量的问题。


处理成千上万个变量的问题时,则要用到不同的方。一种方法是采样出一个节点的(拓扑)排序,然后根据这种排序找到最优的网络结构。这意味着搜索工作只需要在包含可能排序的空间中进行,这比整个网络结构的搜索空间小,因而更加方便。可以进行多次采样和评估,这种方法在现有文献中已被证明在变量数量巨大的情况下是最佳的。[13]


另一种方法是将重点放在具有可分解模型的子类上,其最大似然估计具有封闭形式。这样就有可能为数百个变量发现一致的结构。[14]


学习树宽有限的贝叶斯网络对于精确的,可追溯的推理是必要的,因为最坏情况下的推理复杂度也只是树宽 k 的指数级(在指数时间假设下)。然而,作为图表的全局属性,它大大增加了学习过程的难度。在这种情况下,可以使用 K 树进行有效的学习。[15]

统计学简介

给定观测数据[math]\displaystyle{ x\,\! }[/math] 和模型参数 [math]\displaystyle{ \theta }[/math],一个简单的贝叶斯分析任务就是用已知的先验概率[math]\displaystyle{ p(\theta) }[/math]和似然 [math]\displaystyle{ p(x\mid\theta) }[/math],计算出后验概率[math]\displaystyle{ p(\theta\mid x) \propto p(x\mid\theta)p(\theta) }[/math]


通常,先验概率[math]\displaystyle{ \theta }[/math]又依赖于并没有在似然中出现的参数[math]\displaystyle{ \varphi }[/math],因此,必须用似然[math]\displaystyle{ p(\theta\mid \varphi) }[/math]代替先验概率[math]\displaystyle{ p(\theta) }[/math],且需要用到新引入的参数[math]\displaystyle{ \varphi }[/math]的先验概率[math]\displaystyle{ p(\varphi) }[/math]。如此一来,新的后验概率就变成了


[math]\displaystyle{ p(\theta,\varphi\mid x) \propto p(x\mid\theta)p(\theta\mid\varphi)p(\varphi). }[/math]


这是 层次贝叶斯模型 hierarchical Bayes model最简单的例子。


这个过程可能会一直重复。例如,参数[math]\displaystyle{ \varphi }[/math]可能依次依赖于其他参数 [math]\displaystyle{ \psi\,\! }[/math],这就又需要[math]\displaystyle{ \psi\,\! }[/math]的先验概率。最终,这个层次化嵌套的过程必须终止,亦即某参数的先验概率不依赖于其他参数。


工业级案例

给定观测到的一组测量数据[math]\displaystyle{ x_1,\dots,x_n\,\! }[/math],每个数据点的误差都服从标准差 standard deviation[math]\displaystyle{ \sigma\,\! }[/math]正态分布 Normal distribution


[math]\displaystyle{ x_i \sim N(\theta_i, \sigma^2) }[/math]


假设我们想要估计[math]\displaystyle{ \theta_i }[/math],一种方法是使用最大似然法。由于每个观测值是独立的,似然分解和最大似然估计很简单:

[math]\displaystyle{ \theta_i = x_i. }[/math]


然而,如果每个数据点相关的,例如每个[math]\displaystyle{ \theta_i }[/math]本身就是从一个潜在的分布中采样出来的出来的,那么这就破坏了独立性,并意味着要用到一个更复杂的模型,例如:


[math]\displaystyle{ x_i \sim N(\theta_i,\sigma^2), }[/math]
[math]\displaystyle{ \theta_i\sim N(\varphi, \tau^2), }[/math]


其中[math]\displaystyle{ \varphi\sim\text{flat} }[/math], [math]\displaystyle{ \tau\sim\text{flat} \in (0,\infty) }[/math]是不准确的先验概率。这是一个确定的模型(即,模型参数存在唯一解),且每个[math]\displaystyle{ \theta_i }[/math]的后验分布会在最大似然估计中被移除,或者说收缩,换成了他们的均值。这种“收缩”是层次贝叶斯模型中的典型处理方法。


先验概率的约束条件

在层次结构模型中选择先验分布时需要特别注意,尤其是身处高层次的变量,比如上述例子中的变量 [math]\displaystyle{ \tau\,\! }[/math] 。常用的的先验分布,例如Jeffreys先验往往不起作用,因为后验概率不是正规化的,而最小化预期损失得出的估计通常也无效。


定义与概念

贝叶斯网络现在有好几个等价的定义,在下文的介绍中我们用到两个符号设 G = (V,E)是一个有向无环图,再设X = (Xv), vV为一组随机变量。


因子分解定义

若一组随机变量X的联合概率分布函数可以写成由几个单独的条件概率函数的乘积,则X是关于图G的贝叶斯网络:


[math]\displaystyle{ p (x) = \prod_{v \in V} p \left(x_v \,\big|\, x_{\operatorname{pa}(v)} \right) }[/math]


其中 pa(v)是 v的父变量的集合(即,这些顶点通过一条边直接指向v)。


对于任何一组随机变量,他们各种取值组合的联合概率都可以通过使用链式规则(给定一个 X 的拓扑排序)从条件概率中计算出来,如下所示:


[math]\displaystyle{ \operatorname P(X_1=x_1, \ldots, X_n=x_n) = \prod_{v=1}^n \operatorname P \left(X_v=x_v \mid X_{v+1}=x_{v+1}, \ldots, X_n=x_n \right) }[/math]


使用上面贝叶斯网络的定义,还可以写成这样:


[math]\displaystyle{ \operatorname P(X_1=x_1, \ldots, X_n=x_n) = \prod_{v=1}^n \operatorname P (X_v=x_v \mid X_j=x_j \text{ for each } X_j\, \text{ that is a parent of } X_v\, ) }[/math]


这两个表达式之间的区别是:给定父变量值,所有变量与他们的非后代变量条件独立。


局部马尔可夫性

若一组随机变量X满足局部马尔可夫性 local Markov property,即每个变量在给定父变量的情况下,条件独立于所有非后代变量,则称 X是关于 G 的一个贝叶斯网络: :


[math]\displaystyle{ X_v \perp\!\!\!\perp X_{V \,\smallsetminus\, \operatorname{de}(v)} \mid X_{\operatorname{pa}(v)} \quad\text{for all }v \in V }[/math]


其中 de(v)是节点v的后代集合,V \ de(v)是节点v的非后代集合。


这可以用类似于第一个定义的术语来表示,如


[math]\displaystyle{ \begin{align} & \operatorname P(X_v=x_v \mid X_i=x_i \text{ for each } X_i \text{ that is not a descendant of } X_v\, ) \\[6pt] = {} & P(X_v=x_v \mid X_j=x_j \text{ for each } X_j \text{ that is a parent of } X_v\, ) \end{align} }[/math]


The set of parents is a subset of the set of non-descendants because the graph is acyclic.

The set of parents is a subset of the set of non-descendants because the graph is acyclic.

因为贝叶斯网络是非循环的 acyclic,所以每个节点的父节点集同时也是该节点的是非后代节点集的一个子集。


生成一个贝叶斯网络

Developing a Bayesian network often begins with creating a DAG G such that X satisfies the local Markov property with respect to G. Sometimes this is a causal DAG. The conditional probability distributions of each variable given its parents in G are assessed. In many cases, in particular in the case where the variables are discrete, if the joint distribution of X is the product of these conditional distributions, then X is a Bayesian network with respect to G.[16]

Developing a Bayesian network often begins with creating a DAG G such that X satisfies the local Markov property with respect to G. Sometimes this is a causal DAG. The conditional probability distributions of each variable given its parents in G are assessed. In many cases, in particular in the case where the variables are discrete, if the joint distribution of X is the product of these conditional distributions, then X is a Bayesian network with respect to G.

生成一个贝叶斯网络通常从创建一个有向无环图G开始,然后使图G里的随机变量X满足的局部马尔可夫性。有时这个图同时还是一个因果图。接下来,图里的每一个随机变量的概率分布都要被估计出来。在许多情况下,特别是在变量都是离散的情况下,如果 X 的联合分布是这些单个随机变量的条件分布的乘积,那么 X 就是 G 的贝叶斯网络。


马尔科夫毯

The Markov blanket of a node is the set of nodes consisting of its parents, its children, and any other parents of its children. The Markov blanket renders the node independent of the rest of the network; the joint distribution of the variables in the Markov blanket of a node is sufficient knowledge for calculating the distribution of the node. X is a Bayesian network with respect to G if every node is conditionally independent of all other nodes in the network, given its Markov blanket.模板:Sfn

The Markov blanket of a node is the set of nodes consisting of its parents, its children, and any other parents of its children. The Markov blanket renders the node independent of the rest of the network; the joint distribution of the variables in the Markov blanket of a node is sufficient knowledge for calculating the distribution of the node. X is a Bayesian network with respect to G if every node is conditionally independent of all other nodes in the network, given its Markov blanket.

一个节点的马尔可夫毯是由其父节点、其子节点和其子节点的所有其他父节点组成的节点集。一个节点的马尔可夫毯可以使该节点独立于网络的其余部分。一个节点的马尔可夫毯中所有变量的联合分布是计算该节点分布的一个充分条件。如果网络中的每个节点再给定其马尔可夫毯的情况下,条件地独立于网络中的所有其他节点,那么 X 就是 G 的贝叶斯网络。


d-分隔

This definition can be made more general by defining the "d"-separation of two nodes, where d stands for directional.[17][18] We first define the "d"-separation of a trail and then we will define the "d"-separation of two nodes in terms of that.

This definition can be made more general by defining the "d"-separation of two nodes, where d stands for directional. We first define the "d"-separation of a trail and then we will define the "d"-separation of two nodes in terms of that.

使用d-分隔的概念,贝叶斯网络的定义可以更加通用,其中 d 代表有向(directed)。我们首先定义一条轨迹的d-分隔的线索,然后我们再定义两个节点间的d-分隔。


Let P be a trail from node u to v. A trail is a loop-free, undirected (i.e. all edge directions are ignored) path between two nodes. Then P is said to be d-separated by a set of nodes Z if any of the following conditions holds:

Let P be a trail from node u to v. A trail is a loop-free, undirected (i.e. all edge directions are ignored) path between two nodes. Then P is said to be d-separated by a set of nodes Z if any of the following conditions holds:

P 是从节点 u 到节点 v 的轨迹。轨迹是一条两个节点之间的无环、无向的(即所有边方向被忽略)的路径。如果下列任何一个条件成立,则称 P 被一组节点 Z 分隔:

  • P contains (but does not need to be entirely) a directed chain, [math]\displaystyle{ u \cdots \leftarrow m \leftarrow \cdots v }[/math] or [math]\displaystyle{ u \cdots \rightarrow m \rightarrow \cdots v }[/math], such that the middle node m is in Z,
  • “P”包含一条有向链[math]\displaystyle{ u \cdots \leftarrow m \leftarrow \cdots v }[/math] or [math]\displaystyle{ u \cdots \rightarrow m \rightarrow \cdots v }[/math], 其中间节点m属于点集Z
  • P contains a fork, [math]\displaystyle{ u \cdots \leftarrow m \rightarrow \cdots v }[/math], such that the middle node m is in Z, or
  • “P”包含一个分叉[math]\displaystyle{ u \cdots \leftarrow m \rightarrow \cdots v }[/math], 其中间节点m位于“Z”中,或
  • P contains an inverted fork (or collider), [math]\displaystyle{ u \cdots \rightarrow m \leftarrow \cdots v }[/math], such that the middle node m is not in Z and no descendant of m is in Z.
  • “P”包含一个倒叉(或称对撞),[math]\displaystyle{ u \cdots \rightarrow m \leftarrow \cdots v }[/math],其中间节点m不在Z中,在Z中也没有m的后代节点。


The nodes u and v are d-separated by Z if all trails between them are d-separated. If u and v are not d-separated, they are d-connected.

The nodes u and v are d-separated by Z if all trails between them are d-separated. If u and v are not d-separated, they are d-connected.

如果节点 uv 之间的所有轨迹都是d-分隔的,则称节点 uvZ 分开。如果 'uv 不是d-分隔的,则称它们是 ‘’d‘’-连通的。


X is a Bayesian network with respect to G if, for any two nodes u, v:

X is a Bayesian network with respect to G if, for any two nodes u, v:

我们称XG的贝叶斯网络,当对于图G中任意两个节点 uv满足:


[math]\displaystyle{ X_u \perp\!\!\!\perp X_v \mid X_Z }[/math]


where Z is a set which d-separates u and v. (The Markov blanket is the minimal set of nodes which d-separates node v from all other nodes.)

where Z is a set which d-separates u and v. (The Markov blanket is the minimal set of nodes which d-separates node v from all other nodes.)

其中 Z 是一个将 uv进行了 d-分隔的集合(马尔可夫毯其实就是将节点 v 与其他所有节点d-分隔的最小节点集合)。


因果网络

Although Bayesian networks are often used to represent causal relationships, this need not be the case: a directed edge from u to v does not require that Xv be causally dependent on Xu. This is demonstrated by the fact that Bayesian networks on the graphs:

Although Bayesian networks are often used to represent causal relationships, this need not be the case: a directed edge from u to v does not require that Xv be causally dependent on Xu. This is demonstrated by the fact that Bayesian networks on the graphs:

虽然贝叶斯网络经常被用来表示因果关系,但这种情形并不是必须的: 从 uv 的有向边并不意味着 Xv一定是Xu导致的结果。下面这两个贝叶斯网络是等价的:


[math]\displaystyle{ a \rightarrow b \rightarrow c \qquad \text{and} \qquad a \leftarrow b \leftarrow c }[/math]


are equivalent: that is they impose exactly the same conditional independence requirements.

are equivalent: that is they impose exactly the same conditional independence requirements.

也就是说,它们表示的条件独立要求完全相同。


A causal network is a Bayesian network with the requirement that the relationships be causal. The additional semantics of causal networks specify that if a node X is actively caused to be in a given state x (an action written as do(X = x)), then the probability density function changes to that of the network obtained by cutting the links from the parents of X to X, and setting X to the caused value x.[1] Using these semantics, the impact of external interventions from data obtained prior to intervention can be predicted.

A causal network is a Bayesian network with the requirement that the relationships be causal. The additional semantics of causal networks specify that if a node X is actively caused to be in a given state x (an action written as do(X = x)), then the probability density function changes to that of the network obtained by cutting the links from the parents of X to X, and setting X to the caused value x. Using these semantics, the impact of external interventions from data obtained prior to intervention can be predicted.

当一个贝叶斯网络中的节点关系全都是因果关系时,这网络才能被称为时因果网络。因果网络还规定,如果一个节点 X再被干预的情况下变成了状态 x (写作do(X = x)) ,那么概率密度函数就会发生改变,表示成一个从 X 的父节点到 X的链接被切断的网络,并将变量X的值设置为x。利用这些规则,我们可以在真正实施外部干预前,就能从数据中预测到实施干预后会产生的影响。


推理复杂度与近似算法

In 1990, while working at Stanford University on large bioinformatic applications, Cooper proved that exact inference in Bayesian networks is NP-hard.[19] This result prompted research on approximation algorithms with the aim of developing a tractable approximation to probabilistic inference. In 1993, Dagum and Luby proved two surprising results on the complexity of approximation of probabilistic inference in Bayesian networks.[20] First, they proved that no tractable deterministic algorithm can approximate probabilistic inference to within an absolute error ɛ < 1/2. Second, they proved that no tractable randomized algorithm can approximate probabilistic inference to within an absolute error ɛ < 1/2 with confidence probability greater than 1/2.

1990年,Cooper在斯坦福大学研究大规模生物信息学应用时,证明了贝叶斯网络中的精确推理是NP-hard(NP难问题,NP 是指非确定性多项式( non-deterministic polynomial ,缩写 NP ) 。 这个结果促进了近似算法的研究,目的是开发出一种易于处理的近似方法来进行概率推理。1993年,Dagum 和 Luby 证明了贝叶斯网络中近似概率推理复杂度的两个令人惊讶的结果。首先,他们证明了任何易于处理的确定性近似概率推断算法的绝对误差都不可能小于1/2。其次,他们证明了没有任何易于处理的随机化的近似概率推断算法可以在置信度大于1 / 2的情况下,误差小于1 / 2。


At about the same time, Roth proved that exact inference in Bayesian networks is in fact #P-complete (and thus as hard as counting the number of satisfying assignments of a conjunctive normal form formula (CNF) and that approximate inference within a factor 2n1−ɛ for every ɛ > 0, even for Bayesian networks with restricted architecture, is NP-hard.[21][22]

At about the same time, Roth proved that exact inference in Bayesian networks is in fact #P-complete (and thus as hard as counting the number of satisfying assignments of a conjunctive normal form formula (CNF) and that approximate inference within a factor 2n1−ɛ for every ɛ > 0, even for Bayesian networks with restricted architecture, is NP-hard.

与此同时,Roth 证明了贝叶斯网络中的精确推理实际上是 #P-完全的(因此就像计数使得一个合取范式表达式(CNF)为真的赋值的数量一样困难) 。即使对于受限结构的贝叶斯网络来说,在对于任意ɛ > 0,2n1−ɛ 的近似推理也是 NP-hard的。


In practical terms, these complexity results suggested that while Bayesian networks were rich representations for AI and machine learning applications, their use in large real-world applications would need to be tempered by either topological structural constraints, such as naïve Bayes networks, or by restrictions on the conditional probabilities. The bounded variance algorithm[23] was the first provable fast approximation algorithm to efficiently approximate probabilistic inference in Bayesian networks with guarantees on the error approximation. This powerful algorithm required the minor restriction on the conditional probabilities of the Bayesian network to be bounded away from zero and one by 1/p(n) where p(n) was any polynomial on the number of nodes in the network n.

In practical terms, these complexity results suggested that while Bayesian networks were rich representations for AI and machine learning applications, their use in large real-world applications would need to be tempered by either topological structural constraints, such as naïve Bayes networks, or by restrictions on the conditional probabilities. The bounded variance algorithm was the first provable fast approximation algorithm to efficiently approximate probabilistic inference in Bayesian networks with guarantees on the error approximation. This powerful algorithm required the minor restriction on the conditional probabilities of the Bayesian network to be bounded away from zero and one by 1/p(n) where p(n) was any polynomial on the number of nodes in the network n.

实际上,这些复杂性结果表明,虽然贝叶斯网络在人工智能和机器学习应用中是一种强大的表示,但它们在大型实际应用中需要对拓扑结构施加约束(如朴素贝叶斯网络)或对条件概率施加约束。有界方差算法 Bounded variance algorithm 是贝叶斯网络中第一个被证明在保证误差的情况下还能进行快速近似概率推理的算法。这个强大的算法需要对贝叶斯网络的条件概率进行细微的限制,使其处于[0 + 1/p(n), 1 - 1/p(n)]的区间内 ,其中 p (n)是网络中节点数n的任意多项式。


Software软件

! ——这个列表中的条目与来源于维基百科的文章同样“值得注意”。参见 WP: GNG 和 WP: WTAF。-->

Notable software for Bayesian networks include:

Notable software for Bayesian networks include:

著名的贝叶斯网络软件包括:

  • Just another Gibbs sampler (JAGS) – Open-source alternative to WinBUGS. Uses Gibbs sampling.
  • Just another Gibbs sampler (JAGS) WinBUGS的开源替代品。使用了吉布斯采样。
  • OpenBUGS – Open-source development of WinBUGS.
  • OpenBUGS –WinBUGS的开源版本。
  • SPSS Modeler – Commercial software that includes an implementation for Bayesian networks.
  • SPSS Modeler –一个包括了贝叶斯网络实现的商业软件。
  • Stan (software) – Stan is an open-source package for obtaining Bayesian inference using the No-U-Turn sampler (NUTS),[24] a variant of Hamiltonian Monte Carlo.
  • Stan (software) – stan是一个开源软件包,用于使用No-U-Turn取样器(NUTS),[25] 是汉密尔顿蒙特卡洛方法的一个变体。
  • PyMC3 – A Python library implementing an embedded domain specific language to represent bayesian networks, and a variety of samplers (including NUTS)
  • PyMC3 – 一个python库,它实现了一个能用来表示贝叶斯网络的微型语言,以及各种采样器(包括No-U-Turn取样器)。
  • WinBUGS – One of the first computational implementations of MCMC samplers. No longer maintained.
  • WinBUGS –马尔可夫链蒙特卡罗采样器最早的实现之一,但这个软件已经不再维护。


History历史

The term Bayesian network was coined by Judea Pearl in 1985 to emphasize:

1985年,朱迪亚 · 珀尔创造了贝叶斯网络一词来强调:

  • the often subjective nature of the input information
  • 输入信息通常是主观的
  • the reliance on Bayes' conditioning as the basis for updating information
  • 依赖贝叶斯条件作为信息更新的基础
  • the distinction between causal and evidential modes of reasoning[26]
  • 因果推理和相关推理是有区别的

In the late 1980s Pearl's Probabilistic Reasoning in Intelligent Systems[27] and Neapolitan's Probabilistic Reasoning in Expert Systems[28] summarized their properties and established them as a field of study.

In the late 1980s Pearl's Probabilistic Reasoning in Intelligent Systems and Neapolitan's Probabilistic Reasoning in Expert Systems summarized their properties and established them as a field of study.

20世纪80年代后期,珀尔的《智能系统中的概率推理》和那不勒斯的《专家系统中的概率推理》总结了它们的性质,并将它们确立为一个研究领域。

更多

模板:Portal

模板:Portal

Notes注释

  1. 1.0 1.1 1.2 1.3 Pearl, Judea (2000). [[[:模板:Google books]] Causality: Models, Reasoning, and Inference]. Cambridge University Press. ISBN 978-0-521-77362-1. OCLC 42291253. 模板:Google books. 
  2. "The Back-Door Criterion" (PDF). Retrieved 2014-09-18.
  3. "d-Separation without Tears" (PDF). Retrieved 2014-09-18.
  4. Pearl J (1994). Lopez de Mantaras R, Poole D (eds.). A Probabilistic Calculus of Actions. San Mateo CA: Morgan Kaufmann. pp. 454–462. arXiv:1302.6835. Bibcode:2013arXiv1302.6835P. ISBN 1-55860-332-8. {{cite conference}}: Unknown parameter |booktitle= ignored (help)
  5. "Identification of Conditional Interventional Distributions". Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence. Corvallis, OR: AUAI Press. 2006. pp. 437–444. arXiv:1206.6876. 
  6. "The Recovery of Causal Poly-trees from Statistical Data". Proceedings, 3rd Workshop on Uncertainty in AI. Seattle, WA. 1987. pp. 222–228. arXiv:1304.2736. 
  7. Spirtes P, Glymour C (1991). "An algorithm for fast recovery of sparse causal graphs" (PDF). Social Science Computer Review. 9 (1): 62–72. doi:10.1177/089443939100900106.
  8. Spirtes, Peter; Glymour, Clark N.; Scheines, Richard (1993). [[[:模板:Google books]] Causation, Prediction, and Search] (1st ed.). Springer-Verlag. ISBN 978-0-387-97979-3. 模板:Google books. 
  9. Verma T, Pearl J (1991). Bonissone P, Henrion M, Kanal LN, Lemmer JF (eds.). [[[:模板:Google books]] Equivalence and synthesis of causal models]. Elsevier. pp. 255–270. ISBN 0-444-89264-8. {{cite conference}}: Check |url= value (help); Unknown parameter |booktitle= ignored (help)
  10. Friedman, Nir; Geiger, Dan; Goldszmidt, Moises (November 1997). "Bayesian Network Classifiers". Machine Learning. 29 (2–3): 131–163. doi:10.1023/A:1007465528199. {{cite journal}}: Unknown parameter |name-list-format= ignored (help)
  11. Friedman N, Linial M, Nachman I, Pe'er D (August 2000). "Using Bayesian networks to analyze expression data". Journal of Computational Biology. 7 (3–4): 601–20. CiteSeerX 10.1.1.191.139. doi:10.1089/106652700750050961. PMID 11108481.
  12. Cussens, James (2011). "Bayesian network learning with cutting planes" (PDF). Proceedings of the 27th Conference Annual Conference on Uncertainty in Artificial Intelligence: 153–160. arXiv:1202.3713. Bibcode:2012arXiv1202.3713C. {{cite journal}}: Unknown parameter |name-list-format= ignored (help)
  13. "Learning Bayesian Networks with Thousands of Variables". NIPS-15: Advances in Neural Information Processing Systems. 28. 2015. pp. 1855–1863. https://papers.nips.cc/paper/5803-learning-bayesian-networks-with-thousands-of-variables. 
  14. Petitjean F, Webb GI, Nicholson AE (2013). Scaling log-linear analysis to high-dimensional data (PDF). International Conference on Data Mining. Dallas, TX, USA: IEEE.
  15. M. Scanagatta, G. Corani, C. P. de Campos, and M. Zaffalon. Learning Treewidth-Bounded Bayesian Networks with Thousands of Variables. In NIPS-16: Advances in Neural Information Processing Systems 29, 2016.
  16. Neapolitan, Richard E. (2004). [[[:模板:Google books]] Learning Bayesian networks]. Prentice Hall. ISBN 978-0-13-012534-7. 模板:Google books. 
  17. Geiger, Dan; Verma, Thomas; Pearl, Judea (1990). "Identifying independence in Bayesian Networks" (PDF). Networks. 20: 507–534. doi:10.1177/089443939100900106. {{cite journal}}: Unknown parameter |name-list-format= ignored (help)
  18. Richard Scheines, D-separation
  19. Cooper GF (1990). "The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks" (PDF). Artificial Intelligence. 42 (2–3): 393–405. doi:10.1016/0004-3702(90)90060-d.
  20. Dagum P, Luby M (1993). "Approximating probabilistic inference in Bayesian belief networks is NP-hard". Artificial Intelligence. 60 (1): 141–153. CiteSeerX 10.1.1.333.1586. doi:10.1016/0004-3702(93)90036-b.
  21. D. Roth, On the hardness of approximate reasoning, IJCAI (1993)
  22. D. Roth, On the hardness of approximate reasoning, Artificial Intelligence (1996)
  23. Dagum P, Luby M (1997). "An optimal approximation algorithm for Bayesian inference". Artificial Intelligence. 93 (1–2): 1–27. CiteSeerX 10.1.1.36.7946. doi:10.1016/s0004-3702(97)00013-1. Archived from the original on 2017-07-06. Retrieved 2015-12-19.
  24. 模板:Cite document
  25. 模板:Cite document
  26. Bayes, T.; Price (1763). "An Essay towards solving a Problem in the Doctrine of Chances". Philosophical Transactions of the Royal Society. 53: 370–418. doi:10.1098/rstl.1763.0053. {{cite journal}}: Unknown parameter |name-list-format= ignored (help)
  27. [[[:模板:Google books]] Probabilistic Reasoning in Intelligent Systems]. San Francisco CA: Morgan Kaufmann. 1988-09-15. pp. 1988. ISBN 978-1558604797. 模板:Google books. 
  28. Neapolitan, Richard E. (1989). [[[:模板:Google books]] Probabilistic reasoning in expert systems: theory and algorithms]. Wiley. ISBN 978-0-471-61840-9. 模板:Google books. 

References 参考文献

  • Castillo, Enrique; Gutiérrez, José Manuel; Hadi, Ali S. (1997). "Learning Bayesian Networks". Expert Systems and Probabilistic Network Models. Monographs in computer science. New York: Springer-Verlag. pp. 481–528. ISBN 978-0-387-94858-4. 
  • Gelman, Andrew; Carlin, John B; Stern, Hal S; Rubin, Donald B (2003). [[[:模板:Google books]] "Part II: Fundamentals of Bayesian Data Analysis: Ch.5 Hierarchical models"]. [[[:模板:Google books]] Bayesian Data Analysis]. CRC Press. pp. 120–. ISBN 978-1-58488-388-3. 模板:Google books. 
Also appears as Heckerman, David (March 1997). "Bayesian Networks for Data Mining". Data Mining and Knowledge Discovery. 1 (1): 79–119. doi:10.1023/A:1009730122752.

Also appears as

也显示为

An earlier version appears as Technical Report MSR-TR-95-06, Microsoft Research, March 1, 1995. The paper is about both parameter and structure learning in Bayesian networks.

An earlier version appears as Technical Report MSR-TR-95-06, Microsoft Research, March 1, 1995. The paper is about both parameter and structure learning in Bayesian networks.

早期的版本是[ https://web.archive.org/web/20060719171558/http://Research.Microsoft.com/Research/pubs/view.aspx?msr_tr_id=MSR-TR-95-06技术报告 MSR-TR-95-06] ,微软研究院,1995年3月1日。本文主要研究贝叶斯网络的参数学习和结构学习。

  • Lunn D, Spiegelhalter D, Thomas A, Best N (November 2009). "The BUGS project: Evolution, critique and future directions". Statistics in Medicine. 28 (25): 3049–67. doi:10.1002/sim.3680. PMID 19630097.


延伸阅读

  • Conrady, Stefan; Jouffe, Lionel (2015-07-01). [[[:模板:Google books]] Bayesian Networks and BayesiaLab – A practical introduction for researchers]. Franklin, Tennessee: Bayesian USA. ISBN 978-0-9965333-0-0. 模板:Google books. 
  • Kruse, Rudolf; Borgelt, Christian; Klawonn, Frank; Moewes, Christian; Steinbrecher, Matthias; Held, Pascal (2013). [[[:模板:Google books]] Computational Intelligence A Methodological Introduction]. London: Springer-Verlag. ISBN 978-1-4471-5012-1. 模板:Google books. 
  • Borgelt, Christian; Steinbrecher, Matthias; Kruse, Rudolf (2009). [[[:模板:Google books]] Graphical Models – Representations for Learning, Reasoning and Data Mining] (Second ed.). Chichester: Wiley. ISBN 978-0-470-74956-2. 模板:Google books. 


外部链接

Category:Graphical models

类别: 图形模型

Category:Causality

分类: 因果关系

Category:Causal inference

类别: 因果推理


This page was moved from wikipedia:en:Bayesian network. Its edit history can be viewed at 贝叶斯网络/edithistory