更改

删除10,766字节 、 2021年1月20日 (三) 15:27
第77行: 第77行:     
The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points.  It's not obvious that this will work, but it can be proven that in this context it does, and that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a maximum or a [[saddle point]].<ref name="Wu" /> In general, multiple maxima may occur, with no guarantee that the global maximum will be found.  Some likelihoods also have [[Mathematical singularity|singularities]] in them, i.e., nonsensical maxima.  For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points.
 
The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points.  It's not obvious that this will work, but it can be proven that in this context it does, and that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a maximum or a [[saddle point]].<ref name="Wu" /> In general, multiple maxima may occur, with no guarantee that the global maximum will be found.  Some likelihoods also have [[Mathematical singularity|singularities]] in them, i.e., nonsensical maxima.  For example, one of the ''solutions'' that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points.
  −
  −
We take the expectation over possible values of the unknown data <math>\mathbf{Z}</math> under the current parameter estimate <math>\theta^{(t)}</math> by multiplying both sides by <math>p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta^{(t)})</math> and summing (or integrating) over <math>\mathbf{Z}</math>.  The left-hand side is the expectation of a constant, so we get:
  −
  −
在当前参数估计 < math > theta ^ {(t)} </math > 下,我们用 < math > p (mathbf { z } mid mathbf { x } ,boldsymbol theta ^ {(t)}) </math > 和(或求和) </math { z } </math > 乘以对未知数据 < math > 的可能值的期望值。左边是一个常数的期望值,所以我们得到:
  −
  −
</ref> They pointed out that the method had been "proposed many times in special circumstances" by earlier authors. One of the earliest is the gene-counting method for estimating allele frequencies by [[Cedric Smith (statistician)|Cedric Smith]].<ref>{{cite journal |last1=Ceppelini |first1=R.M. |title=The estimation of gene frequencies in a random-mating population |journal=Ann. Hum. Genet. |volume=20 |issue=2 |pages=97–115 |doi=10.1111/j.1469-1809.1955.tb01360.x|pmid=13268982 |year=1955 |s2cid=38625779 }}</ref>  A very detailed treatment of the EM method for exponential families was published by Rolf Sundberg in his thesis and several papers<ref name="Sundberg1974">{{cite journal
  −
  −
<math>
  −
  −
《数学》
  −
  −
|last=Sundberg |first=Rolf
  −
  −
\begin{align}
  −
  −
开始{ align }
  −
  −
|title=Maximum likelihood theory for incomplete data from an exponential family
  −
  −
\log p(\mathbf{X}\mid\boldsymbol\theta) &
  −
  −
Logp (mathbf { x } mid boldsymbol theta) &
  −
  −
|journal=Scandinavian Journal of Statistics
  −
  −
= \sum_{\mathbf{Z}} p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta^{(t)}) \log p(\mathbf{X},\mathbf{Z}\mid\boldsymbol\theta)
  −
  −
= sum _ { mathbf { z } p (mathbf { z } mid mathbf { x } ,粗体字 theta ^ {(t)}) log p (mathbf { x } ,mathbf { z } mid 粗体字 theta)
  −
  −
|volume=1 |year=1974 |issue=2 |pages=49–58
  −
  −
- \sum_{\mathbf{Z}} p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta^{(t)}) \log p(\mathbf{Z}\mid\mathbf{X},\boldsymbol\theta) \\
  −
  −
- sum _ { mathbf { z } p (mathbf { z } mid mathbf { x } ,粗体字 theta ^ {(t)}) log p (mathbf { z } mid mathbf { x } ,粗体字 theta)
  −
  −
|jstor=4615553 |mr= 381110
  −
  −
& = Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) + H(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}),
  −
  −
和 = q (粗体符号 theta 中间粗体符号 theta ^ {(t)}) + h (粗体符号 theta 中间粗体符号 theta ^ {(t)}) ,
  −
  −
}}</ref><ref name="Sundberg1971">
  −
  −
\end{align}
  −
  −
结束{ align }
  −
  −
Rolf Sundberg. 1971. ''Maximum likelihood theory and applications for distributions generated when observing a function of an exponential family variable''. Dissertation, Institute for Mathematical Statistics, Stockholm University.</ref><ref name="Sundberg1976">{{cite journal
  −
  −
</math>
  −
  −
数学
  −
  −
|last=Sundberg |first=Rolf
  −
  −
where <math>H(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> is defined by the negated sum it is replacing.
  −
  −
其中 < math > h (粗体符 θ 中间粗体符 θ ^ {(t)}) </math > 由它所替换的负和定义。
  −
  −
|year=1976
  −
  −
This last equation holds for every value of <math>\boldsymbol\theta</math> including <math>\boldsymbol\theta = \boldsymbol\theta^{(t)}</math>,
  −
  −
最后一个等式适用于 < math > boldsymbol theta </math > ,包括 < math > boldsymbol theta = boldsymbol theta ^ {(t)} </math > ,
  −
  −
|title=An iterative method for solution of the likelihood equations for incomplete data from exponential families
  −
  −
<math>
  −
  −
《数学》
  −
  −
|journal=Communications in Statistics – Simulation and Computation
  −
  −
\log p(\mathbf{X}\mid\boldsymbol\theta^{(t)})
  −
  −
Logp (mathbf { x } mid boldsymbol theta ^ {(t)})
  −
  −
|volume=5 |issue=1 |pages=55–64
  −
  −
= Q(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}) + H(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}),
  −
  −
= q (黑体符号 theta ^ {(t)}中黑体符号 theta ^ {(t)}) + h (黑体符号 theta ^ {(t)}中黑体符号 theta ^ {(t)}) ,
  −
  −
|doi=10.1080/03610917608812007 |mr=443190
  −
  −
</math>
  −
  −
数学
  −
  −
}}</ref> following his collaboration with [[Per Martin-Löf]] and [[Anders Martin-Löf]].<ref>See the acknowledgement by Dempster, Laird and Rubin on pages 3, 5 and 11.</ref><ref>G. Kulldorff. 1961.'' Contributions to the theory of estimation from grouped and partially grouped samples''. Almqvist & Wiksell.</ref><ref name="Martin-Löf1963">Anders Martin-Löf. 1963. "Utvärdering av livslängder i subnanosekundsområdet" ("Evaluation of sub-nanosecond lifetimes"). ("Sundberg formula")</ref><ref name="Martin-Löf1966">[[Per Martin-Löf]]. 1966. ''Statistics from the point of view of statistical mechanics''. Lecture notes, Mathematical Institute, Aarhus University. ("Sundberg formula" credited to Anders Martin-Löf).</ref><ref name="Martin-Löf1970">[[Per Martin-Löf]]. 1970. ''Statistika Modeller (Statistical Models): Anteckningar från seminarier läsåret 1969–1970 (Notes from seminars in the academic year 1969-1970), with the assistance of Rolf Sundberg.'' Stockholm University. ("Sundberg formula")</ref><!-- * Martin-Löf, P. "Exact tests, confidence regions and estimates", with a discussion by [[A. W. F. Edwards]], [[George A. Barnard|G. A. Barnard]], D. A. Sprott, O. Barndorff-Nielsen, [[D. Basu]] and [[Rasch model|G. Rasch]]. ''Proceedings of Conference on Foundational Questions in Statistical Inference'' (Aarhus, 1973), pp. 121–138. Memoirs, No. 1, Dept. Theoret. Statist., Inst. Math., Univ. Aarhus, Aarhus, 1974. --><ref name="Martin-Löf1974a">Martin-Löf, P. The notion of redundancy and its use as a quantitative measure of the deviation between a statistical hypothesis and a set of observational data. With a discussion by F. Abildgård, [[Arthur P. Dempster|A. P. Dempster]], [[D. Basu]], [[D. R. Cox]], [[A. W. F. Edwards]], D. A. Sprott, [[George A. Barnard|G. A. Barnard]], O. Barndorff-Nielsen, J. D. Kalbfleisch and [[Rasch model|G. Rasch]] and a reply by the author. ''Proceedings of Conference on Foundational Questions in Statistical Inference'' (Aarhus, 1973), pp. 1–42. Memoirs, No. 1, Dept. Theoret. Statist., Inst. Math., Univ. Aarhus, Aarhus, 1974.</ref><ref name="Martin-Löf1974b">{{cite journal |last=Martin-Löf |first=Per |title=The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a set of observational data |journal=Scand. J. Statist. |volume=1 |year=1974 |issue=1 |pages=3–18 |doi= }}</ref> The Dempster–Laird–Rubin paper in 1977 generalized the method and sketched a convergence analysis for a wider class of problems. The Dempster–Laird–Rubin paper established the EM method as an important tool of statistical analysis.
  −
  −
and subtracting this last equation from the previous equation gives
  −
  −
然后从前一个方程中减去最后一个方程
  −
  −
  −
  −
<math>
  −
  −
《数学》
  −
  −
The convergence analysis of the Dempster–Laird–Rubin algorithm was flawed and a correct convergence analysis was published by [[C. F. Jeff Wu]] in 1983.<ref name="Wu">
  −
  −
\log p(\mathbf{X}\mid\boldsymbol\theta) - \log p(\mathbf{X}\mid\boldsymbol\theta^{(t)})
  −
  −
Log p (mathbf { x } mid boldsymbol theta)-log p (mathbf { x } mid boldsymbol theta ^ {(t)}))
  −
  −
{{cite journal
  −
  −
= Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) - Q(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)})
  −
  −
= q (黑体符号 theta 中间黑体符号 theta ^ {(t)})-q (黑体符号 theta ^ {(t)}中间黑体符号 theta ^ {(t)})
  −
  −
|first=C. F. Jeff
  −
  −
+ H(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) - H(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}),
  −
  −
+ h (粗体符号 theta 中间粗体符号 theta ^ {(t)})-h (粗体符号 theta ^ {(t)}中间粗体符号 theta ^ {(t)}) ,
  −
  −
|last=Wu
  −
  −
</math>
  −
  −
数学
  −
  −
|title=On the Convergence Properties of the EM Algorithm
  −
  −
However, Gibbs' inequality tells us that <math>H(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) \ge H(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)})</math>, so we can conclude that
  −
  −
然而,吉布斯不等式告诉我们 < math > h (粗体字 theta mid 粗体字 theta ^ {(t)}) ge h (粗体字 theta ^ {(t)} mid 粗体字 theta ^ {(t)}) </math > ,所以我们可以得出这样的结论:
  −
  −
|journal=[[Annals of Statistics]]
  −
  −
<math>
  −
  −
《数学》
  −
  −
|volume=11
  −
  −
\log p(\mathbf{X}\mid\boldsymbol\theta) - \log p(\mathbf{X}\mid\boldsymbol\theta^{(t)})
  −
  −
Log p (mathbf { x } mid boldsymbol theta)-log p (mathbf { x } mid boldsymbol theta ^ {(t)}))
  −
  −
|issue=1
  −
  −
\ge Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) - Q(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}).
  −
  −
Ge q (粗体符号 theta 中间粗体符号 theta ^ {(t)})-q (粗体符号 theta ^ {(t)}中间粗体符号 theta ^ {(t)})。
  −
  −
|date=Mar 1983
  −
  −
</math>
  −
  −
数学
  −
  −
|pages=95–103
  −
  −
In words, choosing <math>\boldsymbol\theta</math> to improve <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> causes <math>\log p(\mathbf{X}\mid\boldsymbol\theta)</math> to improve at least as much.
  −
  −
换句话说,选择 < math > > boldsymbol theta </math > 来改进 < math > q (粗体字 theta mid 粗体字 theta ^ {(t)}) </math > 导致 < math > log p (mathbf { x } mid 粗体字 theta) </math > 改进至少同样多。
  −
  −
|jstor=2240463
  −
  −
|doi=10.1214/aos/1176346060
  −
  −
|mr= 684867
  −
  −
The EM algorithm can be viewed as two alternating maximization steps, that is, as an example of coordinate descent. Consider the function:
  −
  −
算法可以看作是两个交替的最大化步骤,也就是说,作为一个例子的坐标下降法。考虑一下功能:
  −
  −
|doi-access=free
  −
  −
<math> F(q,\theta) := \operatorname{E}_q [ \log L (\theta ; x,Z) ] + H(q), </math>
  −
  −
F (q,theta) : = 操作者名{ e } _ q [ log l (theta; x,z)] + h (q) ,</math >
  −
  −
}}</ref>
  −
  −
where q is an arbitrary probability distribution over the unobserved data z and H(q) is the entropy of the distribution q. This function can be written as
  −
  −
其中 q 是未观测数据 z 上的任意概率分布,h (q)是分布 q 的熵。这个函数可以写成
  −
  −
Wu's proof established the EM method's convergence outside of the [[exponential family]], as claimed by Dempster–Laird–Rubin.<ref name="Wu" />
  −
  −
<math> F(q,\theta) = -D_{\mathrm{KL}}\big(q \parallel p_{Z\mid X}(\cdot\mid x;\theta ) \big) + \log L(\theta;x), </math>
  −
  −
(q,theta) =-d _ { mathrm { KL } big (q 并行 p _ { z mid x }(cdot mid x; theta) big) + log l (theta; x) ,</math >
  −
  −
  −
  −
where  <math>p_{Z\mid X}(\cdot\mid x;\theta )</math> is the conditional distribution of the unobserved data given the observed data <math>x</math> and <math>D_{KL}</math> is the Kullback–Leibler divergence.
  −
  −
其中 p { z mid x }(cdot mid x; theta) </math > 是给定观测数据的未观测数据的条件分布 < math > x </math > 和 < math > d { KL } </math > 是 Kullback-Leibler 发散。
      
==Description==
 
==Description==