贝叶斯推断

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
跳到导航 跳到搜索

此词条暂由tueryeye翻译,翻译字数共5807,带来阅读不便,请见谅。

模板:Bayesian statistics Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".


贝叶斯推断是一种推论统计学的方法。当人们对某件事发生与否有一个假设的概率时,我们需要收集更多的信息来更新这个概率,贝叶斯推断则是根据贝叶斯定理和收集到的证据或信息来更新某个假设的概率。它是统计学,尤其是数理统计学中的一项重要技术。贝叶斯更新在数据序列的动态分析中也尤为重要。不止如此,贝叶斯推断还有十分广泛的应用,涉及的领域包括科学、工程、哲学、医学、体育和法律。在决策理论的哲学中,贝叶斯推断与主观概率密切相关,通常被称为贝叶斯概率。

贝叶斯规则介绍

文件:Bayes theorem visualisation.svg
A geometric visualisation of Bayes' theorem. In the table, the values 2, 3, 6 and 9 give the relative weights of each corresponding condition and case. The figures denote the cells of the table involved in each metric, the probability being the fraction of each figure that is shaded. This shows that P(A|B) P(B) = P(B|A) P(A) i.e. P(A|B) = 模板:Sfrac . Similar reasoning can be used to show that P(¬A|B) = 模板:Sfrac etc.

正式定义

Contingency table
模板:Diagonal split header Satisfies
hypothesis
H
Violates
hypothesis
¬H

Total
Has evidence
E
P(H|E)·P(E)
= P(E|H)·P(H)
P(¬H|E)·P(E)
= P(E|¬H)·P(¬H)
P(E)
No evidence
¬E
P(H|¬E)·P(¬E)
= P(¬E|H)·P(H)
P(¬H|¬E)·P(¬E)
= P(¬E|¬H)·P(¬H)
P(¬E) =
1−P(E)
Total    P(H) P(¬H) = 1−P(H) 1

Bayesian inference derives the posterior probability as a consequence of two antecedents: a prior probability and a "likelihood function" derived from a statistical model for the observed data. Bayesian inference computes the posterior probability according to Bayes' theorem:

贝叶斯推断是利用贝叶斯公式,根据没有收集证据前的先验概率和由观测数据的统计模型得到的似然函数推导出的后验概率:

[math]\displaystyle{ P(H\mid E) = \frac{P(E\mid H) \cdot P(H)}{P(E)} }[/math]

where:

  • [math]\displaystyle{ \textstyle H }[/math] stands for any hypothesis whose probability may be affected by data (called evidence below). Often there are competing hypotheses, and the task is to determine which is the most probable.
  • [math]\displaystyle{ \textstyle P(H) }[/math], the prior probability, is the estimate of the probability of the hypothesis [math]\displaystyle{ \textstyle H }[/math] before the data [math]\displaystyle{ \textstyle E }[/math], the current evidence, is observed.
  • [math]\displaystyle{ \textstyle E }[/math], the evidence, corresponds to new data that were not used in computing the prior probability.
  • [math]\displaystyle{ \textstyle P(H\mid E) }[/math], the posterior probability, is the probability of [math]\displaystyle{ \textstyle H }[/math] given [math]\displaystyle{ \textstyle E }[/math], i.e., after [math]\displaystyle{ \textstyle E }[/math] is observed. This is what we want to know: the probability of a hypothesis given the observed evidence.
  • [math]\displaystyle{ \textstyle P(E\mid H) }[/math] is the probability of observing [math]\displaystyle{ \textstyle E }[/math] given [math]\displaystyle{ \textstyle H }[/math], and is called the likelihood. As a function of [math]\displaystyle{ \textstyle E }[/math] with [math]\displaystyle{ \textstyle H }[/math] fixed, it indicates the compatibility of the evidence with the given hypothesis. The likelihood function is a function of the evidence, [math]\displaystyle{ \textstyle E }[/math], while the posterior probability is a function of the hypothesis, [math]\displaystyle{ \textstyle H }[/math].
  • [math]\displaystyle{ \textstyle P(E) }[/math] is sometimes termed the marginal likelihood or "model evidence". This factor is the same for all possible hypotheses being considered (as is evident from the fact that the hypothesis [math]\displaystyle{ \textstyle H }[/math] does not appear anywhere in the symbol, unlike for all the other factors), so this factor does not enter into determining the relative probabilities of different hypotheses.

其中

  • [math]\displaystyle{ \textstyle H }[/math] 代表任何可能受数据(也称为证据)影响的假设。通常我们有好几个备选的假设,我们的目标就是确定哪一个假设是最有可能的。
  • [math]\displaystyle{ \textstyle H }[/math] ,先验概率,是假设 [math]\displaystyle{ \textstyle H }[/math] 在当前的证据 [math]\displaystyle{ \textstyle E }[/math]被观察到之前的概率。
  • [math]\displaystyle{ \textstyle E }[/math],证据,则是我们刚刚观测到的还没有用于计算先验概率的新数据。
  • [math]\displaystyle{ \textstyle P(H\mid E) }[/math],后验概率,是给定了证据 [math]\displaystyle{ \textstyle E }[/math]后得到 [math]\displaystyle{ \textstyle H }[/math] 的概率。这就是我们想知道的: 当我们观察到某些证据后假设的概率。
  • [math]\displaystyle{ \textstyle P(E\mid H) }[/math] 是指在给定假设[math]\displaystyle{ \textstyle H }[/math] 的情况下,观察到[math]\displaystyle{ \textstyle P(E) }[/math]的概率,我们把它称为似然函数。它表明了证据与给定假设的兼容性。似然函数是关于证据的一个函数,[math]\displaystyle{ \textstyle E }[/math],而后验概率是关于假设的一个函数,[math]\displaystyle{ \textstyle H }[/math]
  • [math]\displaystyle{ \textstyle E }[/math]有时被称为边际似然或“模型证据”。这个值对于所有被考虑的可能假设是相同的,它不影响不同假设的相对概率。

For different values of [math]\displaystyle{ \textstyle H }[/math], only the factors [math]\displaystyle{ \textstyle P(H) }[/math] and [math]\displaystyle{ \textstyle P(E\mid H) }[/math], both in the numerator, affect the value of [math]\displaystyle{ \textstyle P(H\mid E) }[/math] – the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence).


对于不同的假设[math]\displaystyle{ \textstyle H }[/math],只有分子中的[math]\displaystyle{ \textstyle P(H) }[/math][math]\displaystyle{ \textstyle P(E\mid H) }[/math]影响了它的值。由于它们的分母是一样的,故一个假设的后验概率与它的先验概率(其固有的似然)和新获得的似然函数(它与新观察到的证据的相容性)成正比。

Bayes' rule can also be written as follows:

[math]\displaystyle{ \begin{align}P(H\mid E) &= \frac{P(E\mid H) P(H)}{P(E)} \\ \\ &= \frac{P(E\mid H) P(H)}{P(E\mid H) P(H) + P(E\mid \neg H) P(\neg H)} \\ \\ &= \frac{1}{1 + \left(\frac{1}{P(H)}-1 \right) \frac{P(E\mid \neg H)}{P(E\mid H)} } \\ \end{align} }[/math]

because

[math]\displaystyle{ P(E) = P(E\mid H) P(H) + P(E\mid \neg H) P(\neg H) }[/math]

and

[math]\displaystyle{ P(H) + P(\neg H) = 1 }[/math]

where [math]\displaystyle{ \neg H }[/math] is "not [math]\displaystyle{ \textstyle H }[/math]", the logical negation of [math]\displaystyle{ \textstyle H }[/math].


贝叶斯规则也可以写成:

[math]\displaystyle{ \begin{align}P(H\mid E) &= \frac{P(E\mid H) P(H)}{P(E)} \\ \\ &= \frac{P(E\mid H) P(H)}{P(E\mid H) P(H) + P(E\mid \neg H) P(\neg H)} \\ \\ &= \frac{1}{1 + \left(\frac{1}{P(H)}-1 \right) \frac{P(E\mid \neg H)}{P(E\mid H)} } \\ \end{align} }[/math]

因为:

[math]\displaystyle{ P(E) = P(E\mid H) P(H) + P(E\mid \neg H) P(\neg H) }[/math]

和:

[math]\displaystyle{ P(H) + P(\neg H) = 1 }[/math]

其中:[math]\displaystyle{ \neg H }[/math]是非[math]\displaystyle{ \textstyle H }[/math],是[math]\displaystyle{ \textstyle H }[/math]的逻辑非。


One quick and easy way to remember the equation would be to use Rule of Multiplication:

我们可以使用乘法法则来简单快捷地来记住这个公式:

[math]\displaystyle{ P(E\cap H) = P(E\mid H) P(H) = P(H\mid E) P(E) }[/math]

贝叶斯更新的另一种方案

Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered rational.

贝叶斯更新这种方法因为计算方法而被广泛使用。然而,这并不是唯一被认为是合理的更新规则。

Ian Hacking noted that traditional "Dutch book" arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote[1][2] "And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. "

Ian Hacking 指出,传统的“荷兰赌”论点没有指定贝叶斯更新: 它们保留了非贝叶斯更新规则可以避免荷兰书籍的可能性。Hacking写道:“ 无论是荷兰赌的论证,还是个人主义者(个人主义认为概率是对个人信念的度量)对于概率公理的其他证据,都不需要动态假设。没有一个人是贝叶斯主义者。[1][2] 当个人主义加入了动态假设后则变成了贝叶斯主义。”

Indeed, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on "probability kinematics") following the publication of Richard C. Jeffrey's rule, which applies Bayes' rule to the case where the evidence itself is assigned a probability.[3] The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactary.[4]

事实上,在理查德 · 杰弗里(Richard c. Jeffrey)的规则发表后,非贝叶斯更新规则也避免了荷兰赌的困境(正如关于“概率运动学”的文献中所讨论的那样) ,这个新规则采用了新的假设,这里证据本身被赋予了概率。[3] 但这个额外假设被认为是复杂的和令人不满意的。[4]

贝叶斯推断的正式描述

定义

  • [math]\displaystyle{ x }[/math], a data point in general. This may in fact be a vector of values.
  • [math]\displaystyle{ \theta }[/math], the parameter of the data point's distribution, i.e., [math]\displaystyle{ x \sim p(x \mid \theta) }[/math] . This may be a vector of parameters.
  • [math]\displaystyle{ \alpha }[/math], the hyperparameter of the parameter distribution, i.e., [math]\displaystyle{ \theta \sim p(\theta \mid \alpha) }[/math] . This may be a vector of hyperparameters.
  • [math]\displaystyle{ \mathbf{X} }[/math] is the sample, a set of [math]\displaystyle{ n }[/math] observed data points, i.e., [math]\displaystyle{ x_1,\ldots,x_n }[/math].
  • [math]\displaystyle{ \tilde{x} }[/math], a new data point whose distribution is to be predicted.
  • [math]\displaystyle{ x }[/math],一个数据点。可能是一个由多个值组成的矢量。
  • [math]\displaystyle{ \theta }[/math],是描述数据点分布的参数,即 [math]\displaystyle{ x \sim p(x \mid \theta) }[/math]。这可能是一个多个参数组成的矢量。
  • [math]\displaystyle{ \alpha }[/math],是描述参数分布的超参数,即[math]\displaystyle{ \theta \sim p(\theta \mid \alpha) }[/math]。这可能是一个多个超参数组成的向量。
  • [math]\displaystyle{ \mathbf{X} }[/math]是一组 n 个观测点组成的数据样本,即[math]\displaystyle{ x_1,\ldots,x_n }[/math]
  • [math]\displaystyle{ \tilde{x} }[/math] ,一个新的数据点,其分布还尚待预测。

贝叶斯推断

  • The prior distribution is the distribution of the parameter(s) before any data is observed, i.e. [math]\displaystyle{ p(\theta \mid \alpha) }[/math] . The prior distribution might not be easily determined; in such a case, one possibility may be to use the Jeffreys prior to obtain a prior distribution before updating it with newer observations.
  • 先验分布是在观测到任何数据之前参数的分布,即[math]\displaystyle{ p(\theta \mid \alpha) }[/math]。先验分布可能不容易确定; 在这种没有获得任何新观测数据的情况下,我们可以使用 Jeffreys prior
  • The sampling distribution is the distribution of the observed data conditional on its parameters, i.e. [math]\displaystyle{ p(\mathbf{X} \mid \theta) }[/math] . This is also termed the likelihood, especially when viewed as a function of the parameter(s), sometimes written [math]\displaystyle{ \operatorname{L}(\theta \mid \mathbf{X}) = p(\mathbf{X} \mid \theta) }[/math] .
  • 抽样分布是观测数据按其参数分布的分布。[math]\displaystyle{ p(\mathbf{X} \mid \theta) }[/math].这也被称为似然函数,特别是当它被看作为关于参数的函数时,有时写为[math]\displaystyle{ \operatorname{L}(\theta \mid \mathbf{X}) = p(\mathbf{X} \mid \theta) }[/math]
  • The marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s), i.e. [math]\displaystyle{ p(\mathbf{X} \mid \alpha) = \int p(\mathbf{X} \mid \theta) p(\theta \mid \alpha) \operatorname{d}\!\theta }[/math] .
  • 边际概率(有时也称为证据)是被边缘化的观测数据在参数上的分布,即[math]\displaystyle{ p(\mathbf{X} \mid \alpha) = \int p(\mathbf{X} \mid \theta) p(\theta \mid \alpha) \operatorname{d}\!\theta }[/math]
  • The posterior distribution is the distribution of the parameter(s) after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference:
[math]\displaystyle{ p(\theta \mid \mathbf{X},\alpha) = \frac{p(\theta,\mathbf{X},\alpha)}{p(\mathbf{X},\alpha)} = \frac{p(\mathbf{X}\mid\theta,\alpha)p(\theta,\alpha)}{p(\mathbf{X}\mid\alpha)p(\alpha)} = \frac{p(\mathbf{X} \mid \theta,\alpha) p(\theta \mid \alpha)}{p(\mathbf{X} \mid \alpha)} \propto p(\mathbf{X} \mid \theta,\alpha) p(\theta \mid \alpha) }[/math].
  • 后验概率是考虑到观测数据后参数的分布情况。这是根据贝叶斯定律得到的,也是贝叶斯推断的核心:

[math]\displaystyle{ p(\theta \mid \mathbf{X},\alpha) = \frac{p(\theta,\mathbf{X},\alpha)}{p(\mathbf{X},\alpha)} = \frac{p(\mathbf{X}\mid\theta,\alpha)p(\theta,\alpha)}{p(\mathbf{X}\mid\alpha)p(\alpha)} = \frac{p(\mathbf{X} \mid \theta,\alpha) p(\theta \mid \alpha)}{p(\mathbf{X} \mid \alpha)} \propto p(\mathbf{X} \mid \theta,\alpha) p(\theta \mid \alpha) }[/math].

  • In practice, for almost all complex Bayesian models used in machine learning, the posterior distribution [math]\displaystyle{ p(\theta \mid \mathbf{X},\alpha) }[/math] is not obtained in a closed form distribution, mainly because the parameter space for [math]\displaystyle{ \theta }[/math] can be very high, or the Bayesian model retains certain hierarchical structure formulated from the observations [math]\displaystyle{ \mathbf{X} }[/math] and parameter [math]\displaystyle{ \theta }[/math]. In such situations, we need to resort to approximation techniques.[5]
  • 在实践中,对于几乎所有用于机器学习的复杂贝叶斯模型,后验概率[math]\displaystyle{ p(\theta \mid \mathbf{X},\alpha) }[/math]都不是从有解析表达式的分布中获得的。这主要是因为 [math]\displaystyle{ \theta }[/math]的参数空间维度可以非常高,又或者贝叶斯模型保留了某些由观测[math]\displaystyle{ \mathbf{X} }[/math]和参数[math]\displaystyle{ \theta }[/math]构成的结构。在这种情况下,我们需要求助于近似求解法。[5]

贝叶斯预测

[math]\displaystyle{ p(\tilde{x} \mid \mathbf{X},\alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \mathbf{X},\alpha) \operatorname{d}\!\theta }[/math]
[math]\displaystyle{ p(\tilde{x} \mid \alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \alpha) \operatorname{d}\!\theta }[/math]
  • 后验预测分布是一个新的数据点在后验分布上边缘化了以后的结果:

[math]\displaystyle{ p(\tilde{x} \mid \mathbf{X},\alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \mathbf{X},\alpha) \operatorname{d}\!\theta }[/math]

  • 先验预测分布是一个新数据点在先验分布上边缘化了以后的结果:

[math]\displaystyle{ p(\tilde{x} \mid \alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \alpha) \operatorname{d}\!\theta }[/math]

Bayesian theory calls for the use of the posterior predictive distribution to do predictive inference, i.e., to predict the distribution of a new, unobserved data point. That is, instead of a fixed point as a prediction, a distribution over possible points is returned. By comparison, prediction in frequentist statistics often involves finding an optimum point estimate of the parameter(s)—e.g., by maximum likelihood or maximum a posteriori estimation (MAP)—and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the variance of the predictive distribution.

贝叶斯理论要求使用后验预测分布进行预测推理,即预测一个新的、未观测到的数据点的分布。也就是说,我们不再预测一个固定的点,而是得到关于新的许许多多可能数据点的概率分布。通过比较,频率统计学中的预测通常寻求找到参数的最佳点估计,例如,通过最大似然法或最大后验概率,然后将这个估计插入数据点分布的公式中。这样做的缺点是它没有考虑到参数值的任何不确定性,因此会低估预测分布的方差。

(In some instances, frequentist statistics can work around this problem. For example, confidence intervals and prediction intervals in frequentist statistics when constructed from a normal distribution with unknown mean and variance are constructed using a Student's t-distribution. This correctly estimates the variance, due to the facts that (1) the average of normally distributed random variables is also normally distributed, and (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a Student's t-distribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least to an arbitrary level of precision when numerical methods are used.

(在某些情况下,频率统计可以绕过这个问题。例如,当频率统计中的置信区间和预测区间由一个均值和方差未知的正态分布构造时,我们可以使用一个 学生 t 分布来构造它。这种方法可以正确地估计方差,因为(1)正态分布随机变量的平均值也是正态分布的,(2)一个具有未知均值和方差的正态分布数据点的预测分布,使用共轭或无信息的先验函数,具有一个学生 t 分布。在贝叶斯统计,当使用数值方法时,后验预测分布总是可以精确地确定---- 或者至少可以达到任意精度。

Both types of predictive distributions have the form of a compound probability distribution (as does the marginal likelihood). In fact, if the prior distribution is a conjugate prior, such that the prior and posterior distributions come from the same family, it can be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the conjugate prior article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.

两种类型的预测分布都具有复合概率分布的形式(边际似然也是如此)。事实上,如果先验分布是一个共轭先验,并且先验分布和后验分布都来自同一个种类的概率分布,那么我们后验预测分布和前验分布将来自同一个复合分布族。唯一的区别是后验预测分布使用的超参数是根据贝叶斯规则更新的,而先验预测分布使用先验分布中出现的超参数值。

互斥的概率推断

If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.

如果证据同时被用来更新一组互斥的信念,那么贝叶斯推断可以被作用于这种信念分布的整体上。

一般公式

文件:Bayesian inference event space.svg
Diagram illustrating event space [math]\displaystyle{ \Omega }[/math] in general formulation of Bayesian inference. Although this diagram shows discrete models and events, the continuous case may be visualized similarly using probability densities.

Suppose a process is generating independent and identically distributed events [math]\displaystyle{ E_n, \,\, n=1,2,3,\ldots }[/math], but the probability distribution is unknown. Let the event space [math]\displaystyle{ \Omega }[/math] represent the current state of belief for this process. Each model is represented by event [math]\displaystyle{ M_m }[/math]. The conditional probabilities [math]\displaystyle{ P(E_n \mid M_m) }[/math] are specified to define the models. [math]\displaystyle{ P(M_m) }[/math] is the degree of belief in [math]\displaystyle{ M_m }[/math]. Before the first inference step, [math]\displaystyle{ \{P(M_m)\} }[/math] is a set of initial prior probabilities. These must sum to 1, but are otherwise arbitrary.

假设我们生成了一些独立的同分布的事件[math]\displaystyle{ E_n, \,\, n=1,2,3,\ldots }[/math]但它们的概率分布是未知的。让事件空间[math]\displaystyle{ \Omega }[/math]表示这个过程的当前信念状态。每个模型都用事件[math]\displaystyle{ M_m }[/math]表示。条件概率 [math]\displaystyle{ P(E_n \mid M_m) }[/math]来定义模型。[math]\displaystyle{ P(M_m) }[/math][math]\displaystyle{ M_m }[/math]的置信度。在第一个推理步骤之前,[math]\displaystyle{ \{P(M_m)\} }[/math]是一组初始先验概率。它们的和必须为1,但可以是任意值。

Suppose that the process is observed to generate [math]\displaystyle{ \textstyle E \in \{E_n\} }[/math]. For each [math]\displaystyle{ M \in \{M_m\} }[/math], the prior [math]\displaystyle{ P(M) }[/math] is updated to the posterior [math]\displaystyle{ P(M \mid E) }[/math]. From Bayes' theorem:[6]

假设我们观察到生成[math]\displaystyle{ \textstyle E \in \{E_n\} }[/math]的过程。对于[math]\displaystyle{ M \in \{M_m\} }[/math],先验[math]\displaystyle{ P(M) }[/math]更新为后验 [math]\displaystyle{ P(M \mid E) }[/math]。贝叶斯定理规定: [6]

[math]\displaystyle{ P(M \mid E) = \frac{P(E \mid M)}{\sum_m {P(E \mid M_m) P(M_m)}} \cdot P(M) }[/math]

Upon observation of further evidence, this procedure may be repeated.

根据观察收集进一步的证据,这一程序可以重复进行。

文件:Set inclustion.png
Venn diagram for the fundamental sets frequently used in Bayesian inference and computations [7]

多重观测

For a sequence of independent and identically distributed observations [math]\displaystyle{ \mathbf{E} = (e_1, \dots, e_n) }[/math], it can be shown by induction that repeated application of the above is equivalent to

[math]\displaystyle{ P(M \mid \mathbf{E}) = \frac{P(\mathbf{E} \mid M)}{\sum_m {P(\mathbf{E} \mid M_m) P(M_m)}} \cdot P(M) }[/math]

对于独立同分布的观测[math]\displaystyle{ \mathbf{E} = (e_1, \dots, e_n) }[/math],通过归纳可以证明,重复应用上述观测数列等价于:

[math]\displaystyle{ P(M \mid \mathbf{E}) = \frac{P(\mathbf{E} \mid M)}{\sum_m {P(\mathbf{E} \mid M_m) P(M_m)}} \cdot P(M) }[/math]

Where

其中

[math]\displaystyle{ P(\mathbf{E} \mid M) = \prod_k{P(e_k \mid M)}. }[/math]

参数公式

By parameterizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is however equally applicable to discrete distributions.

通过对模型空间进行参数化,可以在一个步骤中更新所有模型的信念。这之后信念在模型空间上的分布可以看作是信念在参数空间上的分布。本节中的概率分布均为连续的,如通常情况一样,用概率密度表示。这种方式同样适用于离散分布。

Let the vector [math]\displaystyle{ \mathbf{\theta} }[/math] span the parameter space. Let the initial prior distribution over [math]\displaystyle{ \mathbf{\theta} }[/math] be [math]\displaystyle{ p(\mathbf{\theta} \mid \mathbf{\alpha}) }[/math], where [math]\displaystyle{ \mathbf{\alpha} }[/math] is a set of parameters to the prior itself, or hyperparameters. Let [math]\displaystyle{ \mathbf{E} = (e_1, \dots, e_n) }[/math] be a sequence of independent and identically distributed event observations, where all [math]\displaystyle{ e_i }[/math] are distributed as [math]\displaystyle{ p(e \mid \mathbf{\theta}) }[/math] for some [math]\displaystyle{ \mathbf{\theta} }[/math]. Bayes' theorem is applied to find the posterior distribution over [math]\displaystyle{ \mathbf{\theta} }[/math]:

让向量[math]\displaystyle{ \mathbf{\theta} }[/math]跨越参数空间。让[math]\displaystyle{ \mathbf{\theta} }[/math]上的初始先验分布为 [math]\displaystyle{ p(\mathbf{\theta} \mid \mathbf{\alpha}) }[/math] ,其中[math]\displaystyle{ \mathbf{\alpha} }[/math]是先验本身的一组参数(也称为超参数)。设 [math]\displaystyle{ \mathbf{E} = (e_1, \dots, e_n) }[/math]是一个独立的、同分布的事件观测序列,其中[math]\displaystyle{ e_i }[/math] 都作为 [math]\displaystyle{ p(e \mid \mathbf{\theta}) }[/math]分布于某个 [math]\displaystyle{ \mathbf{\theta} }[/math]中。贝叶斯定理用于求[math]\displaystyle{ \mathbf{\theta} }[/math]上的后验概率分布:

[math]\displaystyle{ \begin{align} p(\mathbf{\theta} \mid \mathbf{E},\mathbf{\alpha}) &= \frac{p(\mathbf{E} \mid \mathbf{\theta},\mathbf{\alpha})}{p(\mathbf{E} \mid \mathbf{\alpha})} \cdot p(\mathbf{\theta}\mid\mathbf{\alpha}) \\ &= \frac{p(\mathbf{E} \mid \mathbf{\theta},\mathbf{\alpha})}{\int p(\mathbf{E}|\mathbf{\theta},\mathbf{\alpha}) p(\mathbf{\theta} \mid \mathbf{\alpha}) \, d\mathbf{\theta}} \cdot p(\mathbf{\theta} \mid \mathbf{\alpha}) \end{align} }[/math]


Where

其中

[math]\displaystyle{ p(\mathbf{E} \mid \mathbf{\theta},\mathbf{\alpha}) = \prod_k p(e_k \mid \mathbf{\theta}) }[/math]

数学性质

模板:More footnotes

解释因子

[math]\displaystyle{ \textstyle \frac{P(E \mid M)}{P(E)} \gt 1 \Rightarrow \textstyle P(E \mid M) \gt P(E) }[/math]. That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change, [math]\displaystyle{ \textstyle \frac{P(E \mid M)}{P(E)} = 1 \Rightarrow \textstyle P(E \mid M) = P(E) }[/math]. That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.

[math]\displaystyle{ \textstyle \frac{P(E \mid M)}{P(E)} \gt 1 \Rightarrow \textstyle P(E \mid M) \gt P(E) }[/math] 。也就是说,如果这个模型是正确的,那么证据发生的概率将比目前的信仰状态所预测的更有可能。相反的情况适用于信念的减少。如果信念没有改变,[math]\displaystyle{ \textstyle \frac{P(E \mid M)}{P(E)} = 1 \Rightarrow \textstyle P(E \mid M) = P(E) }[/math]。也就是说,证据是独立于模型的。如果这个模型是正确的,那么证据发生的可能性就和当前的信仰状态所预测的一样。

Cromwell法则

If [math]\displaystyle{ P(M)=0 }[/math] then [math]\displaystyle{ P(M \mid E)=0 }[/math]. If [math]\displaystyle{ P(M)=1 }[/math], then [math]\displaystyle{ P(M|E)=1 }[/math]. This can be interpreted to mean that hard convictions are insensitive to counter-evidence.

如果[math]\displaystyle{ P(M)=0 }[/math],则[math]\displaystyle{ P(M \mid E)=0 }[/math]。如果[math]\displaystyle{ P(M)=1 }[/math],则 [math]\displaystyle{ P(M|E)=1 }[/math]。这可以解释为,坚定的信念不会因为证据而改变。

The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event "not [math]\displaystyle{ M }[/math]" in place of "[math]\displaystyle{ M }[/math]", yielding "if [math]\displaystyle{ 1 - P(M)=0 }[/math], then [math]\displaystyle{ 1 - P(M \mid E)=0 }[/math]", from which the result immediately follows.

前者直接遵循贝叶斯定理。后者可以通过对事件“ not [math]\displaystyle{ M }[/math]”应用第一条规则来代替“ [math]\displaystyle{ M }[/math]”,产生“如果[math]\displaystyle{ 1 - P(M)=0 }[/math],则[math]\displaystyle{ 1 - P(M \mid E)=0 }[/math]”。

后验的渐近性

Consider the behaviour of a belief distribution as it is updated a large number of times with independent and identically distributed trials. For sufficiently nice prior probabilities, the Bernstein-von Mises theorem gives that in the limit of infinite trials, the posterior converges to a Gaussian distribution independent of the initial prior under some conditions firstly outlined and rigorously proven by Joseph L. Doob in 1948, namely if the random variable in consideration has a finite probability space. The more general results were obtained later by the statistician David A. Freedman who published in two seminal research papers in 1963 [8] and 1965 [9] when and under what circumstances the asymptotic behaviour of posterior is guaranteed. His 1963 paper treats, like Doob (1949), the finite case and comes to a satisfactory conclusion. However, if the random variable has an infinite but countable probability space (i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the Bernstein-von Mises theorem is not applicable. In this case there is almost surely no asymptotic convergence. Later in the 1980s and 1990s Freedman and Persi Diaconis continued to work on the case of infinite countable probability spaces.[10] To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.

考虑一被大量独立的和同分布的数据抽样的信念分布的行为。Joseph l. Doob 在1948年首次概述和严格证明,在某些条件下(如果考虑的随机变量有一个有限的概率空间)根据Bernstein-von Mises 定理,对于足够好的先验概率,它在无限次更新后,后验会收敛到一个独立于初始先验的正态分布。统计学家David A. Freedman在1963年[8]和1965年[9]发表的两篇重要研究论文中得出了关于在什么情况下后验的渐近性能得到了保证的更普遍的结论。他在自己1963年的论文中像 Doob 在1949的论文一样处理了有限的情况,并得出了令人满意的结论。然而,当随机变量有一个无限的可数概率空间(即类似于一个有无限多个面的骰子) ,1965年的论文证明了对于先验的稠密子集,Bernstein-von Mises 定理是不适用的。在这种情况下几乎可以肯定不会得到渐近收敛。在1980年代后期和1990年代,Freedman 和 Persi Diaconis 还在继续研究无限可数概率空间的情况[10]。总之,目前来看我们可能没有足够的抽样来抑制由最初选择带来的影响,特别是对于大型(但有限)系统,收敛可能会非常缓慢。

共轭先验

In parameterized form, the prior distribution is often assumed to come from a family of distributions called conjugate priors. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in closed form.

在参数化形式下,如果后验分布与先验分布属于同类,则先验分布被称为共轭先验。共轭先验的有用之处在于,相应的后验概率也会来自同一种类的概率分布,并且对它的计算可以用解析式表示。

估计参数和预测

It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select measurements of central tendency from the posterior distribution.

我们通常用后验概率来估计某个参数或变量。贝叶斯估计的好几种方法都擅长从后验概率中的选择有集中趋势的样本。

For one-dimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a robust estimator.[11]

对于一维问题,对于实际的连续问题存在唯一的中位数。后验中位数是一个稳健的估计值。[11]

If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation.[12]

[math]\displaystyle{ \tilde \theta = \operatorname{E}[\theta] = \int \theta \, p(\theta \mid \mathbf{X},\alpha) \, d\theta }[/math]

如果存在一个有限的后验概率平均值,那么后验平均值就是一种估计方法。[12]

[math]\displaystyle{ \tilde \theta = \operatorname{E}[\theta] = \int \theta \, p(\theta \mid \mathbf{X},\alpha) \, d\theta }[/math]

Taking a value with the greatest probability defines maximum a posteriori (MAP) estimates:[13]

取一个最大概率来定义最大后验估计:[13]

[math]\displaystyle{ \{ \theta_{\text{MAP}}\} \subset \arg \max_\theta p(\theta \mid \mathbf{X},\alpha) . }[/math]

There are examples where no maximum is attained, in which case the set of MAP estimates is empty.

有些例子没有规定最大值,在这种情况下 MAP 估计集是空的。

There are other methods of estimation that minimize the posterior risk (expected-posterior loss) with respect to a loss function, and these are of interest to statistical decision theory using the sampling distribution ("frequentist statistics").[14]

还有其他一些估计方法可以最小化损失函数的后验风险(期望后验损失) ,这些方法对于使用抽样分布的统计决策理论(“频率统计”)很有意义。[14]

The posterior predictive distribution of a new observation [math]\displaystyle{ \tilde{x} }[/math] (that is independent of previous observations) is determined by[15]

一个新的观察值的后验预测分布[math]\displaystyle{ \tilde{x} }[/math](独立于以前的观察值)由以下公式决定:

[math]\displaystyle{ p(\tilde{x}|\mathbf{X},\alpha) = \int p(\tilde{x},\theta \mid \mathbf{X},\alpha) \, d\theta = \int p(\tilde{x} \mid \theta) p(\theta \mid \mathbf{X},\alpha) \, d\theta . }[/math]

实例

一个假设的概率

Contingency table
模板:Diagonal split header #1
H1
#2
H2

Total
Plain, E 30 20 50
Choc, ¬E 10 20 30
Total 40 40 80
P (H1|E) = 30 / 50 = 0.6

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?

Contingency table
#1H1 #2H2 Total
Plain, E 30 20 50
Choc, ¬E 10 20 30
Total 40 40 80
P (H1|E) = 30 / 50 = 0.6

假设我们有两碗满满的曲奇。1号碗有10块巧克力饼干和30块普通饼干,而2号碗有20块巧克力饼干和20块普通饼干。我们的朋友弗雷德随便拿起一个碗,然后随便拿起一块饼干。我们可以假设,没有理由相信弗雷德会区别对待这两个碗,对于饼干也是如此。我们发现他手上这块饼干是普通的饼干。那么有多大可能这是弗雷德从第一个碗里挑出来的?

Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let [math]\displaystyle{ H_1 }[/math] correspond to bowl #1, and [math]\displaystyle{ H_2 }[/math] to bowl #2. It is given that the bowls are identical from Fred's point of view, thus [math]\displaystyle{ P(H_1)=P(H_2) }[/math], and the two must add up to 1, so both are equal to 0.5. The event [math]\displaystyle{ E }[/math] is the observation of a plain cookie. From the contents of the bowls, we know that [math]\displaystyle{ P(E \mid H_1) = 30/40 = 0.75 }[/math] and [math]\displaystyle{ P(E \mid H_2) = 20/40 = 0.5. }[/math] Bayes' formula then yields

[math]\displaystyle{ \begin{align} P(H_1 \mid E) &= \frac{P(E \mid H_1)\,P(H_1)}{P(E \mid H_1)\,P(H_1)\;+\;P(E \mid H_2)\,P(H_2)} \\ \\ \ & = \frac{0.75 \times 0.5}{0.75 \times 0.5 + 0.5 \times 0.5} \\ \\ \ & = 0.6 \end{align} }[/math]


直觉告诉我们,答案应该是超过50%,因为第一个碗里有更多的普通饼干。贝叶斯定理给出了精确的答案。设 [math]\displaystyle{ H_1 }[/math]对应1号碗,[math]\displaystyle{ H_2 }[/math]对应2号碗。从 Fred 的观点来看,这两个碗是相同的,因此[math]\displaystyle{ P(H_1)=P(H_2) }[/math],两个碗被选的概率总和必须等于1,因此两者都等于0.5。事件 [math]\displaystyle{ E }[/math]是结果是得到一个普通的饼干。从碗内的容物情况可知 [math]\displaystyle{ P(E \mid H_1) = 30/40 = 0.75 }[/math][math]\displaystyle{ P(E \mid H_2) = 20/40 = 0.5. }[/math]。那么根据贝叶斯公式就产生了

[math]\displaystyle{ \begin{align} P(H_1 \mid E) &= \frac{P(E \mid H_1)\,P(H_1)}{P(E \mid H_1)\,P(H_1)\;+\;P(E \mid H_2)\,P(H_2)} \\ \\ \ & = \frac{0.75 \times 0.5}{0.75 \times 0.5 + 0.5 \times 0.5} \\ \\ \ & = 0.6 \end{align} }[/math]

Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, [math]\displaystyle{ P(H_1) }[/math], which was 0.5. After observing the cookie, we must revise the probability to [math]\displaystyle{ P(H_1 \mid E) }[/math], which is 0.6.

在我们观察曲奇之前,我们给 Fred 分配的选择1号碗的概率是先验概率[math]\displaystyle{ P(H_1) }[/math]。在观察到 cookie 之后,我们必须将概率修正为 [math]\displaystyle{ P(H_1 \mid E) }[/math] ,即0.6。

做出预测

An archaeologist is working at a site thought to be from the medieval period, between the 11th century to the 16th century. However, it is uncertain exactly when in this period the site was inhabited. Fragments of pottery are found, some of which are glazed and some of which are decorated. It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated. How confident can the archaeologist be in the date of inhabitation as fragments are unearthed?

一位考古学家正在一处遗址工作,这处遗址被认为是属于11世纪到16世纪之间的中世纪时期的人类的。然而,目前还不能确定这个遗址在这段时期是否有人居住。我们发现了陶器的碎片,有的上了釉,有的上了装饰。如果这个地方在中世纪早期有人居住,那么1% 的陶器会上釉,50% 的陶器会有装饰,而如果这个地方在中世纪晚期有人居住,那么81% 的陶器会上釉,5% 的陶器会有装饰。考古学家能有多大把握根据碎片确定它们的主人的居住年代呢?

The degree of belief in the continuous variable [math]\displaystyle{ C }[/math] (century) is to be calculated, with the discrete set of events [math]\displaystyle{ \{GD,G \bar D, \bar G D, \bar G \bar D\} }[/math] as evidence. Assuming linear variation of glaze and decoration with time, and that these variables are independent,

以离散事件集[math]\displaystyle{ \{GD,G \bar D, \bar G D, \bar G \bar D\} }[/math]为证据,计算连续变量 [math]\displaystyle{ C }[/math](世纪)。假设釉色和装饰随时间呈线性变化,并且这些变量是独立的,

[math]\displaystyle{ P(E=GD \mid C=c) = (0.01 + \frac{0.81-0.01}{16-11}(c-11))(0.5 - \frac{0.5-0.05}{16-11}(c-11)) }[/math]
[math]\displaystyle{ P(E=G \bar D \mid C=c) = (0.01 + \frac{0.81-0.01}{16-11}(c-11))(0.5 + \frac{0.5-0.05}{16-11}(c-11)) }[/math]
[math]\displaystyle{ P(E=\bar G D \mid C=c) = ((1-0.01) - \frac{0.81-0.01}{16-11}(c-11))(0.5 - \frac{0.5-0.05}{16-11}(c-11)) }[/math]
[math]\displaystyle{ P(E=\bar G \bar D \mid C=c) = ((1-0.01) - \frac{0.81-0.01}{16-11}(c-11))(0.5 + \frac{0.5-0.05}{16-11}(c-11)) }[/math]

Assume a uniform prior of [math]\displaystyle{ \textstyle f_C(c) = 0.2 }[/math], and that trials are independent and identically distributed. When a new fragment of type [math]\displaystyle{ e }[/math] is discovered, Bayes' theorem is applied to update the degree of belief for each [math]\displaystyle{ c }[/math]:

假设[math]\displaystyle{ \textstyle f_C(c) = 0.2 }[/math]的先验符合均匀分布,并且抽样是独立的和同分布的。当一个 [math]\displaystyle{ e }[/math]型的新片段被发现时,贝叶斯定理被用来更新每个 [math]\displaystyle{ c }[/math]的可信度:

[math]\displaystyle{ f_C(c \mid E=e) = \frac{P(E=e \mid C=c)}{P(E=e)}f_C(c) = \frac{P(E=e \mid C=c)}{\int_{11}^{16}{P(E=e \mid C=c)f_C(c)dc}}f_C(c) }[/math]

A computer simulation of the changing belief as 50 fragments are unearthed is shown on the graph. In the simulation, the site was inhabited around 1420, or [math]\displaystyle{ c=15.2 }[/math]. By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century. The Bernstein-von Mises theorem asserts here the asymptotic convergence to the "true" distribution because the probability space corresponding to the discrete set of events [math]\displaystyle{ \{GD,G \bar D, \bar G D, \bar G \bar D\} }[/math] is finite (see above section on asymptotic behaviour of the posterior).

这张图表显示了50块出土的计算机模拟碎片,它们反映了这种观念C的改变。在模拟中,这个地方大约在1420年有人居住,或者说[math]\displaystyle{ c=15.2 }[/math]。通过计算图中相关部分下的面积,经过50次试验,考古学家可以说,这个遗址几乎没有可能在11世纪和12世纪有人居住,大约1% 的可能性是在13世纪有人居住,63% 的可能性是在14世纪有人居住,36% 是在15世纪有人居住。Bernstein-von Mises 定理在这里断言“真”分布的渐近收敛性,因为离散事件集[math]\displaystyle{ \{GD,G \bar D, \bar G D, \bar G \bar D\} }[/math]的概率空间是有限的(见上面关于后验渐近行为的部分)。

在频率统计学和决策理论中

A decision-theoretic justification of the use of Bayesian inference was given by Abraham Wald, who proved that every unique Bayesian procedure is admissible. Conversely, every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures.[16]

Abraham Wald,给出了一个决策论的理由来证明使用贝叶斯推断的合理性,他证明了每一个独特的贝叶斯程序都是可以采纳的。相反,每个可采纳的统计过程要么是贝叶斯过程,要么是贝叶斯过程的限制。[16]

Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of frequentist inference as parameter estimation, hypothesis testing, and computing confidence intervals.[17][18][19] For example:

  • "Under some conditions, all admissible procedures are either Bayes procedures or limits of Bayes procedures (in various senses). These remarkable results, at least in their original form, are due essentially to Wald. They are useful because the property of being Bayes is easier to analyze than admissibility."[16]
  • "In decision theory, a quite general method for proving admissibility consists in exhibiting a procedure as a unique Bayes solution."[20]
  • "In the first chapters of this work, prior distributions with finite support and the corresponding Bayes procedures were used to establish some of the main theorems relating to the comparison of experiments. Bayes procedures with respect to more general prior distributions have played a very important role in the development of statistics, including its asymptotic theory." "There are many problems where a glance at posterior distributions, for suitable priors, yields immediately interesting information. Also, this technique can hardly be avoided in sequential analysis."[21]
  • "A useful fact is that any Bayes decision rule obtained by taking a proper prior over the whole parameter space must be admissible"[22]
  • "An important area of investigation in the development of admissibility ideas has been that of conventional sampling-theory procedures, and many interesting results have been obtained."[23]

Wald 将可容许的过程描述为贝叶斯 过程(以及贝叶斯过程的限制) ,使 贝叶斯形式化成为频率论推断领域的核心技术,如参数估计、假设检验和计算置信区间。[17][18][19]例如:

  • “在某些条件下,所有可接受的过程要么是贝叶斯过程,要么是贝叶斯过程的限制(在各种意义上)。这些显著的成果,至少在其原始形式,主要是Wald的功劳 。它们之所以有用,是因为在决策论中贝叶斯的性质比可容许性更容易分析。
  • 证明可采性的一个相当普遍的方法是展示其为一个唯一的贝叶斯解[20]
  • “在这项工作的第一章中,有限支持的先验分布和相应的贝叶斯过程被用来建立一些与实验相关的主要定理。与更普遍的先验分布相关的贝叶斯过程在统计学的发展中发挥了非常重要的作用,包括在渐近理论分布中。毕竟只要看一眼后验分布,找到合适的先验,就能立即得到有趣的信息。此外,这种技术在顺序分析中也难以避免被经常使用。”[21]
  • “一个有用的事实是,通过对整个参数空间取得适当先验而获得的任何贝叶斯决定规则都必须是可接受的”[22]
  • “在可接受性思想的发展过程中,一个重要的领域是传统的抽样理论,其中已获得许多有趣的结果”[23]

模型的选择

Bayesian methodology also plays a role in model selection where the aim is to select one model from a set of competing models that represents most closely the underlying process that generated the observed data. In Bayesian model comparison, the model with the highest posterior probability given the data is selected. The posterior probability of a model depends on the evidence, or marginal likelihood, which reflects the probability that the data is generated by the model, and on the prior belief of the model. When two competing models are a priori considered to be equiprobable, the ratio of their posterior probabilities corresponds to the Bayes factor. Since Bayesian model comparison is aimed on selecting the model with the highest posterior probability, this methodology is also referred to as the maximum a posteriori (MAP) selection rule [24] or the MAP probability rule.[25]

贝叶斯方法在模型选择中也起着作用,当我们想从一组竞争模型中选择一个最接近于产生样本的基本过程接近的模型。在贝叶斯模型比较中,选择给定数据的后验概率最高的模型。模型的后验概率依赖于证据,或者说边际似然度,它反映了数据由模型生成的概率,以及模型的先验信念。当两个相互竞争的模型被先验地认为是等概率的时候,它们的后验概率比率对应于它们的贝叶斯因子。由于贝叶斯模型比较的目的是选择后验概率最高的模型,这种方法也被称为最大后验选择规则[24]或最大后验概率规则[25]

概率编程

While conceptually simple, Bayesian methods can be mathematically and numerically challenging. Probabilistic programming languages (PPLs) implement functions to easily build Bayesian models together with efficient automatic inference methods. This helps separate the model building from the inference, allowing practitioners to focus on their specific problems and leaving PPLs to handle the computational details for them.[26][27][28]

虽然概念上简单,贝叶斯方法可以在数学和数值领域上被挑战。概率编程语言(PPLs)实现了一些功能,可以方便地建立贝叶斯模型有效的自动推理方法。这有助于将模型构建与推断分离开来,允许实践者专注于研究他们的具体问题,而让 PPLs 处理他们的计算细节。.[26][27][28]

应用程序

计算机应用

Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s.[29] There is also an ever-growing connection between Bayesian methods and simulation-based Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis–Hastings algorithm schemes.[30] Recently模板:When Bayesian inference has gained popularity among the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously.

在人工智能和专家系统领域中都有贝叶斯推断的应用。自20世纪50年代以来,贝叶斯推断技术一直是计算机模式识别技术的基本组成部分。贝叶斯方法与基于模拟的蒙特卡罗技术之间的联系日益密切,因为贝叶斯分析无法以解析式的形式处理复杂模型,然而图形模型结构可能允许诸如Gibbs sampling 和其他 Metropolis-Hastings 算法等有效的模拟算法。由于这些原因,贝叶斯推断在系统发育学界受到欢迎; 它的应用使同时估计许多人口统计学和进化参数成为可能。

As applied to statistical classification, Bayesian inference has been used to develop algorithms for identifying e-mail spam. Applications which make use of Bayesian inference for spam filtering include CRM114, DSPAM, Bogofilter, SpamAssassin, SpamBayes, Mozilla, XEAMS, and others. Spam classification is treated in more detail in the article on the naïve Bayes classifier.

贝叶斯推断已经被用于开发识别电子邮件垃圾邮件的算法。使用贝叶斯推断过滤垃圾邮件的应用程序包括 CRM114、 DSPAM、 Bogofilter、 SpamAssassin、 SpamBayes、 Mozilla、 XEAMS 等等。垃圾邮件分类在关于基础的贝叶斯分类器的文章中有更详细的介绍。

Solomonoff's Inductive inference is the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable probability distribution. It is a formal inductive framework that combines two well-studied principles of inductive inference: Bayesian statistics and Occam’s Razor.[31]模板:Rs inline Solomonoff's universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion.[32][33]

Solomonoff的归纳推理是基于观察的预测理论; 例如,基于给定的一系列符号预测下一个符号。唯一的假设是,环境遵循一些未知但可计算的概率分布。这是一个正式的归纳框架,它结合了两个经过科学家深入研究的归纳推理原则: 贝叶斯统计和 Occam 的剃刀(越简单的模型越好)。Solomonoff's 的可计算序列 x 的任意前缀 p 的通用先验概率是所有以 p 开始计算的程序的概率之和。给定一些 p 和任何可计算但未知的概率分布,从中取样 x,通用先验和贝叶斯定理可以用最优方式预测 x 中尚未看见的部分。[32][33]

生物信息学和医疗保健应用

Bayesian inference has been applied in different Bioinformatics applications, including differential gene expression analysis.[34] Bayesian inference is also used in a general cancer risk model, called CIRI (Continuous Individualized Risk Index), where serial measurements are incorporated to update a Bayesian model which is primarily built from prior knowledge.[35][36]

贝叶斯推断已经应用于不同的生物信息学应用,包括差异基因表达分析[34]。贝叶斯推断也用于一般的癌症风险模型,称为 CIRI (连续个体化风险指数) ,其中的系列测量结合更新贝叶斯模型,主要是建立与先验的知识。[35][36]

在法庭上

Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt'.[37][38][39] Bayes' theorem is applied successively to all evidence presented, with the posterior from one stage becoming the prior for the next. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence. It may be appropriate to explain Bayes' theorem to jurors in odds form, as betting odds are more widely understood than probabilities. Alternatively, a logarithmic approach, replacing multiplication with addition, might be easier for a jury to handle.

陪审员可以使用贝叶斯推断来连贯地收集对被告有利和不利的证据,并且从总体上看,它是否达到了他们个人的‘排除合理怀疑’的门槛。.[37][38][39]贝叶斯定理成功地应用于所有证据,从一个阶段的后验成为下一阶段的先验。贝叶斯方法的好处在于,它为陪审员提供了一个合并证据的无偏见、合理的机制。用赔率的形式向陪审员解释贝叶斯定理也许是合适的,因为赌注赔率比概率被更广泛地理解。就像对数方法,用加法代替乘法,可能更容易为陪审团处理。

If the existence of the crime is not in doubt, only the identity of the culprit, it has been suggested that the prior should be uniform over the qualifying population.[40] For example, if 1,000 people could have committed the crime, the prior probability of guilt would be 1/1000.

如果罪行的存在没有疑问,但罪犯的身份存在疑问,有人建议先验概率应均匀分布对待符合条件的犯罪嫌疑人。[40]举例来说,如果1,000个人可能犯了罪,那么犯罪的先验概率将是1/1000。

The use of Bayes' theorem by jurors is controversial. In the United Kingdom, a defence expert witness explained Bayes' theorem to the jury in R v Adams. The jury convicted, but the case went to appeal on the basis that no means of accumulating evidence had been provided for jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave the opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task."

陪审员使用贝叶斯定理是有争议的。在英国,一名辩方专家证人在 R 诉 Adams 一案中向陪审团解释了贝叶斯定理。陪审团裁定有罪,但该案件提出上诉,理由是没有为不愿使用贝叶斯定理的陪审员提供任何积累证据的手段。上诉法院维持原判,但也认为,“在刑事审判中引入贝叶斯定理或任何类似的方法,会使陪审团陷入理论和复杂性的不适当和不必要的领域,偏离他们应有的任务。”

Gardner-Medwin[41] argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent (akin to a frequentist p-value). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions:

Gardner-Medwin[41] 认为,在刑事审判中作出判决所依据的标准不是有罪的可能性,而是证据的可能性,因为被告是无辜的(类似于频率 p 值)。他认为,如果要用贝叶斯定理计算罪行的后验概率,那么罪行的先验概率必须是已知的。这将取决于犯罪的发生率,在刑事审判中,这是一个不寻常的证据。考虑以下三个主张:

A The known facts and testimony could have arisen if the defendant is guilty
B The known facts and testimony could have arisen if the defendant is innocent
C The defendant is guilty.

A 如果被告有罪,就可能产生的已知的事实和证词;

B如果被告无罪,就可能产生的已知的事实和证词;

C被告有罪。

Gardner-Medwin argues that the jury should believe both A and not-B in order to convict. A and not-B implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.

Gardner-Medwin argues that the jury should believe both A and not-B in order to convict. A and not-B implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.

Gardner-Medwin认为,陪审团为了定罪,应该同时相信 A 和 B。A 和 not-B 意味着 cC的真理,但反之则不然。有可能 B 和 cC都是正确的,但是在这个案例中,他认为陪审团应该宣判无罪,即使他们知道他们将会释放一些有罪的人。参见Lindley悖论。

贝叶斯认识论

Bayesian epistemology is a movement that advocates for Bayesian inference as a means of justifying the rules of inductive logic.

贝叶斯认识论是倡导贝叶斯推断作为证明归纳逻辑规则的一种方法的运动。

Karl Popper and David Miller have rejected the idea of Bayesian rationalism, i.e. using Bayes rule to make epistemological inferences:[42] It is prone to the same vicious circle as any other justificationist epistemology, because it presupposes what it attempts to justify. According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of falsification, rejecting the belief, commonly held by Bayesians, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.

Karl PopperDavid Miller 拒绝了贝叶斯理性主义的想法,即运用贝叶斯规则进行认识论推论: 它与其他任何正义主义认识论一样,容易陷入同样的恶性循环,因为它预先假定了它试图证明的东西。根据这种观点,对贝叶斯推断的理性解释将仅仅把它看作是一种曲解的概率版本,否定了贝叶斯学派普遍持有的观点,即通过一系列贝叶斯更新实现的高可能性将证明假设超越了任何合理怀疑,甚至可能性大于0。

其他

这种科学方法有时被解释为贝叶斯推断的应用。在这个观点中,贝叶斯规则指导(或应该指导)更新的概率关于假设条件新的观察或实验。[43]贝叶斯推断也被 Cai 等人应用于处理不完全信息的随机排序问题。[44].

  • 贝叶斯搜索理论是用来搜寻遗失物品的。
  • 贝叶斯贝叶斯推断在系统发育学中的应用
  • 贝叶斯工具用于甲基化分析
  • 贝叶斯方法用于大脑功能研究,将大脑作为一种贝叶斯机制。
  • 生态学研究贝叶斯推断
  • 贝叶斯推断用于估计随机化学动力学模型的参数
  • 货币或股票市场预测经济物理学贝叶斯推断
  • 市场推广贝叶斯推断
  • 动力学习贝叶斯推断

历史

The term Bayesian refers to Thomas Bayes (1702–1761), who proved that probabilistic limits could be placed on an unknown event.[50] However, it was Pierre-Simon Laplace (1749–1827) who introduced (as Principle VI) what is now called Bayes' theorem and used it to address problems in celestial mechanics, medical statistics, reliability, and jurisprudence.[51] Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called "inverse probability" (because it infers backwards from observations to parameters, or from effects to causes[52]). After the 1920s, "inverse probability" was largely supplanted by a collection of methods that came to be called frequentist statistics.[52]

贝叶斯是指Thomas Bayes(1702-1761)提出的 ,他证明了概率极限可以放在一个未知的事件上。[50]然而,是Pierre-Simon Laplace(1749-1827)提出了现在所谓的贝叶斯定理(Principle VI) ,并用它来解决天体力学、医学统计学、可靠性和法理学等问题。[51]早期的贝叶斯推断,遵循拉普拉斯的不充分理性原则,使用统一的先验,被称为“逆概率”(因为它从观测结果推测参数,或从效应推断原因)。20世纪20年代以后,“逆概率”在很大程度上被一系列频率统计学的方法所取代。[52]

In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objective or "non-informative" current, the statistical analysis depends on only the model assumed, the data analyzed,[53] and the method assigning the prior, which differs from one objective Bayesian practitioner to another. In the subjective or "informative" current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc.

20世纪,Laplace的思想进一步向两个不同的方向发展,在贝叶斯实践中产生了客观和主观的潮流。在客观或“非信息”流派中,统计分析仅依赖于假设的模型、分析的数据和赋予先验的方法,这些对于一个客观贝叶斯实践者来说和对另一个客观贝叶斯实践者来说是不同的。在主观的或“信息性的”流派中,先验的说明取决于信念(即分析准备采取行动的命题) ,它可以来自专家的信息,以前的研究等。

In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications.[54] Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics.[55] Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning.[56]

在20世纪80年代,贝叶斯方法的研究和应用出现了戏剧性的增长,这主要归功于马尔科夫蒙特卡洛方法的发现,它消除了许多计算问题,以及对非标准、复杂应用的兴趣不断增加。尽管贝叶斯研究有所增长,但大多数本科教学仍然是基于频率统计学。然而,贝叶斯方法被广泛接受和使用,甚至在在机器学习领域也是如此。


参考

引用

  1. 1.0 1.1 Hacking, Ian (December 1967). "Slightly More Realistic Personal Probability". Philosophy of Science. 34 (4): 316. doi:10.1086/288169. S2CID 14344339.
  2. 2.0 2.1 Hacking (1988, p. 124)模板:Full citation needed
  3. 3.0 3.1 "Bayes' Theorem (Stanford Encyclopedia of Philosophy)". Plato.stanford.edu. Retrieved 2014-01-05.
  4. 4.0 4.1 van Fraassen, B. (1989) Laws and Symmetry, Oxford University Press.
  5. 5.0 5.1 Lee, Se Yoon (2021). "Gibbs sampler and coordinate ascent variational inference: A set-theoretical review". Communications in Statistics - Theory and Methods: 1–21. arXiv:2008.01006. doi:10.1080/03610926.2021.1921214. S2CID 220935477.
  6. 6.0 6.1 Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.;Vehtari, Aki; Rubin, Donald B. (2013). Bayesian Data Analysis, Third Edition. Chapman and Hall/CRC. .
  7. Lee, Se Yoon (2021). "Gibbs sampler and coordinate ascent variational inference: A set-theoretical review". Communications in Statistics - Theory and Methods: 1–21. arXiv:2008.01006. doi:10.1080/03610926.2021.1921214. S2CID 220935477.
  8. 8.0 8.1 Freedman, DA (1963). "On the asymptotic behavior of Bayes' estimates in the discrete case". The Annals of Mathematical Statistics. 34 (4): 1386–1403. doi:10.1214/aoms/1177703871. JSTOR 2238346.
  9. 9.0 9.1 Freedman, DA (1965). "On the asymptotic behavior of Bayes estimates in the discrete case II". The Annals of Mathematical Statistics. 36 (2): 454–456. doi:10.1214/aoms/1177700155. JSTOR 2238150.
  10. 10.0 10.1 Robins, James; Wasserman, Larry (2000). "Conditioning, likelihood, and coherence: A review of some foundational concepts". JASA. 95 (452): 1340–1346. doi:10.1080/01621459.2000.10474344. S2CID 120767108.
  11. 11.0 11.1 Sen, Pranab K.; Keating, J. P.; Mason, R. L. (1993). Pitman's measure of closeness: A comparison of statistical estimators. Philadelphia: SIAM. 
  12. 12.0 12.1 Choudhuri, Nidhan; Ghosal, Subhashis; Roy, Anindya (2005-01-01). Bayesian Methods for Function Estimation. Bayesian Thinking. 25. pp. 373–414. doi:10.1016/s0169-7161(05)25013-7. ISBN 9780444515391. 
  13. 13.0 13.1 "Maximum A Posteriori (MAP) Estimation". www.probabilitycourse.com (in English). Retrieved 2017-06-02.
  14. 14.0 14.1 Yu, Angela. "Introduction to Bayesian Decision Theory" (PDF). cogsci.ucsd.edu/. Archived from the original (PDF) on 2013-02-28.
  15. Hitchcock, David. "Posterior Predictive Distribution Stat Slide" (PDF). stat.sc.edu.
  16. 16.0 16.1 16.2 Bickel & Doksum (2001, p. 32)
  17. 17.0 17.1 Kiefer, J.; Schwartz R. (1965). "Admissible Bayes Character of T2-, R2-, and Other Fully Invariant Tests for Multivariate Normal Problems". Annals of Mathematical Statistics. 36 (3): 747–770. doi:10.1214/aoms/1177700051.
  18. 18.0 18.1 Schwartz, R. (1969). "Invariant Proper Bayes Tests for Exponential Families". Annals of Mathematical Statistics. 40: 270–283. doi:10.1214/aoms/1177697822.
  19. 19.0 19.1 Hwang, J. T. & Casella, George (1982). "Minimax Confidence Sets for the Mean of a Multivariate Normal Distribution" (PDF). Annals of Statistics. 10 (3): 868–881. doi:10.1214/aos/1176345877.
  20. 20.0 20.1 Lehmann, Erich (1986). Testing Statistical Hypotheses (Second ed.).  (see p. 309 of Chapter 6.7 "Admissibility", and pp. 17–18 of Chapter 1.8 "Complete Classes"
  21. 21.0 21.1 Le Cam, Lucien (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag. ISBN 978-0-387-96307-5.  (From "Chapter 12 Posterior Distributions and Bayes Solutions", p. 324)
  22. 22.0 22.1 Cox, D. R.; Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall. p. 432. ISBN 978-0-04-121537-3. 
  23. 23.0 23.1 Cox, D. R.; Hinkley, D.V. (1974). Theoretical Statistics. Chapman and Hall. p. 433. ISBN 978-0-04-121537-3. )
  24. 24.0 24.1 Stoica, P.; Selen, Y. (2004). "A review of information criterion rules". IEEE Signal Processing Magazine. 21 (4): 36–47. doi:10.1109/MSP.2004.1311138. S2CID 17338979.
  25. 25.0 25.1 Fatermans, J.; Van Aert, S.; den Dekker, A.J. (2019). "The maximum a posteriori probability rule for atom column detection from HAADF STEM images". Ultramicroscopy. 201: 81–91. arXiv:1902.05809. doi:10.1016/j.ultramic.2019.02.003. PMID 30991277. S2CID 104419861.
  26. 26.0 26.1 Bessiere, P., Mazer, E., Ahuactzin, J. M., & Mekhnacha, K. (2013). Bayesian Programming (1 edition) Chapman and Hall/CRC.
  27. 27.0 27.1 Daniel Roy (2015). "Probabilistic Programming". probabilistic-programming.org. Archived from the original on 2016-01-10. Retrieved 2020-01-02.
  28. 28.0 28.1 Ghahramani, Z (2015). "Probabilistic machine learning and artificial intelligence". Nature. 521 (7553): 452–459. Bibcode:2015Natur.521..452G. doi:10.1038/nature14541. PMID 26017444. S2CID 216356.
  29. Fienberg, Stephen E. (2006-03-01). "When did Bayesian inference become "Bayesian"?". Bayesian Analysis. 1 (1). doi:10.1214/06-BA101.
  30. Jim Albert (2009). Bayesian Computation with R, Second edition. New York, Dordrecht, etc.: Springer. ISBN 978-0-387-92297-3. 
  31. Rathmanner, Samuel; Hutter, Marcus; Ormerod, Thomas C (2011). "A Philosophical Treatise of Universal Induction". Entropy. 13 (6): 1076–1136. arXiv:1105.5721. Bibcode:2011Entrp..13.1076R. doi:10.3390/e13061076. S2CID 2499910.
  32. 32.0 32.1 Hutter, Marcus; He, Yang-Hui; Ormerod, Thomas C (2007). "On Universal Prediction and Bayesian Confirmation". Theoretical Computer Science. 384 (2007): 33–48. arXiv:0709.1516. Bibcode:2007arXiv0709.1516H. doi:10.1016/j.tcs.2007.05.016. S2CID 1500830.
  33. 33.0 33.1 Gács, Peter; Vitányi, Paul M. B. (2 December 2010). "Raymond J. Solomonoff 1926-2009". CiteSeerX. CiteSeerX 10.1.1.186.8268. {{cite journal}}: Cite journal requires |journal= (help)
  34. 34.0 34.1 Robinson, Mark D & McCarthy, Davis J & Smyth, Gordon K edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics.
  35. 35.0 35.1 "CIRI". ciri.stanford.edu. Retrieved 2019-08-11.
  36. 36.0 36.1 Kurtz, David M.; Esfahani, Mohammad S.; Scherer, Florian; Soo, Joanne; Jin, Michael C.; Liu, Chih Long; Newman, Aaron M.; Dührsen, Ulrich; Hüttmann, Andreas (2019-07-25). "Dynamic Risk Profiling Using Serial Tumor Biomarkers for Personalized Outcome Prediction". Cell. 178 (3): 699–713.e19. doi:10.1016/j.cell.2019.06.011. ISSN 1097-4172. PMC 7380118. PMID 31280963.
  37. 37.0 37.1 Dawid, A. P. and Mortera, J. (1996) "Coherent Analysis of Forensic Identification Evidence". Journal of the Royal Statistical Society, Series B, 58, 425–443.
  38. 38.0 38.1 Foreman, L. A.; Smith, A. F. M., and Evett, I. W. (1997). "Bayesian analysis of deoxyribonucleic acid profiling data in forensic identification applications (with discussion)". Journal of the Royal Statistical Society, Series A, 160, 429–469.
  39. 39.0 39.1 Robertson, B. and Vignaux, G. A. (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. John Wiley and Sons. Chichester.
  40. 40.0 40.1 Dawid, A. P. (2001) Bayes' Theorem and Weighing Evidence by Juries -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期2015-07-01.
  41. 41.0 41.1 Gardner-Medwin, A. (2005) "What Probability Should the Jury Address?". Significance, 2 (1), March 2005
  42. Miller, David (1994). Critical Rationalism. Chicago: Open Court. ISBN 978-0-8126-9197-9. https://books.google.com/books?id=bh_yCgAAQBAJ. 
  43. 43.0 43.1 Howson & Urbach (2005), Jaynes (2003)
  44. 44.0 44.1 Cai, X.Q.; Wu, X.Y.; Zhou, X. (2009). "Stochastic scheduling subject to breakdown-repeat breakdowns with incomplete information". Operations Research. 57 (5): 1236–1249. doi:10.1287/opre.1080.0660.
  45. Ogle, Kiona; Tucker, Colin; Cable, Jessica M. (2014-01-01). "Beyond simple linear mixing models: process-based isotope partitioning of ecological processes". Ecological Applications (in English). 24 (1): 181–195. doi:10.1890/1051-0761-24.1.181. ISSN 1939-5582. PMID 24640543.
  46. Evaristo, Jaivime; McDonnell, Jeffrey J.; Scholl, Martha A.; Bruijnzeel, L. Adrian; Chun, Kwok P. (2016-01-01). "Insights into plant water uptake from xylem-water isotope measurements in two tropical catchments with contrasting moisture conditions". Hydrological Processes (in English). 30 (18): 3210–3227. Bibcode:2016HyPr...30.3210E. doi:10.1002/hyp.10841. ISSN 1099-1085.
  47. Gupta, Ankur; Rawlings, James B. (April 2014). "Comparison of Parameter Estimation Methods in Stochastic Chemical Kinetic Models: Examples in Systems Biology". AIChE Journal. 60 (4): 1253–1268. doi:10.1002/aic.14409. ISSN 0001-1541. PMC 4946376. PMID 27429455.
  48. Fornalski, K.W. (2016). "The Tadpole Bayesian Model for Detecting Trend Changes in Financial Quotations" (PDF). R&R Journal of Statistics and Mathematical Sciences. 2 (1): 117–122.
  49. Schütz, N.; Holschneider, M. (2011). "Detection of trend changes in time series using Bayesian inference". Physical Review E. 84 (2): 021120. arXiv:1104.3448. Bibcode:2011PhRvE..84b1120S. doi:10.1103/PhysRevE.84.021120. PMID 21928962. S2CID 11460968.
  50. 50.0 50.1 Reference Needed
  51. 51.0 51.1 Stigler, Stephen M. (1986). "Chapter 3". The History of Statistics. Harvard University Press. ISBN 9780674403406. https://archive.org/details/historyofstatist00stig. 
  52. 52.0 52.1 52.2 Fienberg, Stephen E. (2006). "When did Bayesian Inference Become 'Bayesian'?". Bayesian Analysis. 1 (1): 1–40 [p. 5]. doi:10.1214/06-ba101.
  53. Bernardo, José-Miguel (2005). "Reference analysis". Handbook of statistics. 25. pp. 17–90. 
  54. Wolpert, R. L. (2004). "A Conversation with James O. Berger". Statistical Science. 19 (1): 205–218. CiteSeerX 10.1.1.71.6112. doi:10.1214/088342304000000053. MR 2082155.
  55. Bernardo, José M. (2006). "A Bayesian mathematical statistics primer" (PDF). Icots-7.
  56. Bishop, C. M. (2007). Pattern Recognition and Machine Learning. New York: Springer. ISBN 978-0387310732. 

资源


  • Aster, Richard; Borchers, Brian, and Thurber, Clifford (2012). Parameter Estimation and Inverse Problems, Second Edition, Elsevier. ,
  • Box, G. E. P. and Tiao, G. C. (1973) Bayesian Inference in Statistical Analysis, Wiley,
  • Jaynes E. T. (2003) Probability Theory: The Logic of Science, CUP. (Link to Fragmentary Edition of March 1996).


  • Aster,Richard; Borchers,Brian,and Thurber,Clifford (2012).参数估计和反问题,第二版,爱思唯尔。,
  • Box,G.E.p. and Tiao,G.c. (1973)贝叶斯推断统计分析,Wiley,
  • Jaynes E.t. (2003)概率论: 科学的逻辑,CUP。(链接到1996年3月的零碎版)。

进一步阅读

  • For a full report on the history of Bayesian statistics and the debates with frequentists approaches, read Vallverdu, Jordi (2016). Bayesians Versus Frequentists A Philosophical Debate on Statistical Reasoning. New York: Springer. ISBN 978-3-662-48638-2. 
  • For a full report on the history of Bayesian statistics and the debates with frequentists approaches, read

关于贝叶斯统计的历史和与频率学家的争论的完整报告,请阅读

初级

The following books are listed in ascending order of probabilistic sophistication: