更改

删除885字节 、 2021年12月25日 (六) 20:19
无编辑摘要
第133行: 第133行:  
where [math]\displaystyle{ H(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) }[/math] is defined by the negated sum it is replacing. This last equation holds for every value of [math]\displaystyle{ \boldsymbol\theta }[/math] including [math]\displaystyle{ \boldsymbol\theta = \boldsymbol\theta^{(t)} }[/math],
 
where [math]\displaystyle{ H(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) }[/math] is defined by the negated sum it is replacing. This last equation holds for every value of [math]\displaystyle{ \boldsymbol\theta }[/math] including [math]\displaystyle{ \boldsymbol\theta = \boldsymbol\theta^{(t)} }[/math],
   −
[math]\displaystyle{ \log p(\mathbf{X}\mid\boldsymbol\theta^{(t)}) In information geometry, the E step and the M step are interpreted as projections under dual affine connections, called the e-connection and the m-connection; the Kullback–Leibler divergence can also be understood in these terms. 在信息几何中,e 步和 m 步被解释为双仿射联系下的投影,称为 e 联系和 m 联系; Kullback-Leibler 分歧也可以用这些术语来理解。 = Q(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}) + H(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}), }[/math]
+
<nowiki>其中 {\displaystyle H({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})}{\displaystyle H({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})} 由它正在替换的否定和定义。 最后一个方程适用于 {\displaystyle {\boldsymbol {\theta }}}{\boldsymbol {\theta }} 的每个值,包括 {\displaystyle {\boldsymbol {\theta }}={\boldsymbol {\theta }}^ {(t)}}\boldsymbol\theta = \boldsymbol\theta^{(t)},</nowiki>
 +
 
 +
<nowiki>{\displaystyle \log p(\mathbf {X} \mid {\boldsymbol {\theta }}^{(t)})=Q({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)})+H({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)}),}{\displaystyle \log p(\mathbf {X} \mid {\boldsymbol {\theta }}^{(t)})=Q({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)})+H({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)}),}</nowiki>
    
and subtracting this last equation from the previous equation gives
 
and subtracting this last equation from the previous equation gives
   −
'''Gaussian mixture'''
+
并从前一个方程中减去最后一个方程给出
   −
 
+
<nowiki>{\displaystyle \log p(\mathbf {X} \mid {\boldsymbol {\theta }})-\log p(\mathbf {X} \mid {\boldsymbol {\theta }}^{(t)})=Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})-Q({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)})+H({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})-H({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)}),}{\displaystyle \log p(\mathbf {X} \mid {\boldsymbol {\theta }})-\log p(\mathbf {X} \mid {\boldsymbol {\theta }}^{(t)})=Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})-Q({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)})+H({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})-H({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)}),}</nowiki>
 
  −
= = = = 高斯混合 = = = = < ! -- 本节链接自矩阵微积分 -- >  
  −
 
  −
[math]\displaystyle{ \log p(\mathbf{X}\mid\boldsymbol\theta) - \log p(\mathbf{X}\mid\boldsymbol\theta^{(t)}) k-means and EM on artificial data visualized with ELKI. Using the variances, the EM algorithm can describe the normal distributions exactly, while k-means splits the data in Voronoi-cells. The cluster center is indicated by the lighter, bigger symbol.]] 基于 ELKI 的人工数据的 k 均值和 EM 可视化。EM 算法利用方差能够准确地描述正态分布,而 k- 均值算法则对 voronoi 单元中的数据进行分割。集群中心由较轻,较大的符号表示。] = Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) - Q(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}) + H(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) - H(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}), An animation demonstrating the EM algorithm fitting a two component Gaussian <nowiki>[[mixture model to the Old Faithful dataset. The algorithm steps through from a random initialization to convergence. ]]</nowiki> 一个动画演示的 EM 算法拟合一个双分量高斯[混合模型的老忠实数据集。该算法步骤从一个随机初始化到收敛。] }[/math]
      
However, Gibbs' inequality tells us that [math]\displaystyle{ H(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) \ge H(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}) }[/math], so we can conclude that
 
However, Gibbs' inequality tells us that [math]\displaystyle{ H(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) \ge H(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}) }[/math], so we can conclude that
   −
Special cases of this model include censored or truncated observations from one normal distribution. or the so-called spectral techniques. Moment-based approaches to learning the parameters of a probabilistic model are of increasing interest recently since they enjoy guarantees such as global convergence under certain conditions unlike EM which is often plagued by the issue of getting stuck in local optima. Algorithms with guarantees for learning can be derived for a number of important models such as mixture models, HMMs etc. For these spectral methods, no spurious local optima occur, and the true parameters can be consistently estimated under some regularity conditions.
+
<nowiki>然而,吉布斯不等式告诉我们 {\displaystyle H({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})\geq H({\boldsymbol {\theta }} ^{(t)}\mid {\boldsymbol {\theta }}^{(t)})}{\displaystyle H({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{( t)})\geq H({\boldsymbol {\theta }}^{(t)}\mid {\boldsymbol {\theta }}^{(t)})},所以我们可以得出结论</nowiki>
 
  −
(mathbf { x } = (mathbf { x } _ 1,mathbf { x } _ 2,ldots,mathbf { x } _ n) <nowiki></math ></nowiki> 是来自两个维数 < math > d <nowiki></math ></nowiki> 的多元正态分布的混合物的一个样本,并且设 < math > mathbf { z } = (z _ 1,z _ 2,ldots,z _ n) <nowiki></math ></nowiki> 是决定观测来源的潜在变量。该模型的特殊情况包括截尾或截断的观测值来自一个正态分布。或者所谓的光谱技术。基于矩的概率模型参数学习方法近年来越来越受到人们的关注,因为它们在一定条件下具有全局收敛性等保证。具有学习保证的算法可以推导出一些重要的模型,如混合模型、隐马尔可夫模型等。对于这些谱方法,没有出现伪局部最优,并且在一定的正则性条件下可以一致地估计真实参数。
      
[math]\displaystyle{ \log p(\mathbf{X}\mid\boldsymbol\theta) - \log p(\mathbf{X}\mid\boldsymbol\theta^{(t)}) \ge Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) - Q(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}). }[/math]
 
[math]\displaystyle{ \log p(\mathbf{X}\mid\boldsymbol\theta) - \log p(\mathbf{X}\mid\boldsymbol\theta^{(t)}) \ge Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) - Q(\boldsymbol\theta^{(t)}\mid\boldsymbol\theta^{(t)}). }[/math]
第155行: 第151行:  
In words, choosing [math]\displaystyle{ \boldsymbol\theta }[/math] to improve [math]\displaystyle{ Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) }[/math] causes [math]\displaystyle{ \log p(\mathbf{X}\mid\boldsymbol\theta) }[/math] to improve at least as much.
 
In words, choosing [math]\displaystyle{ \boldsymbol\theta }[/math] to improve [math]\displaystyle{ Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) }[/math] causes [math]\displaystyle{ \log p(\mathbf{X}\mid\boldsymbol\theta) }[/math] to improve at least as much.
   −
 
+
<nowiki>换句话说,选择 {\displaystyle {\boldsymbol {\theta }}}{\boldsymbol {\theta }} 来改进 {\displaystyle Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^ {(t)})}{\displaystyle Q({\boldsymbol {\theta }}\mid {\boldsymbol {\theta }}^{(t)})} 导致 {\displaystyle \log p(\mathbf {X } \mid {\boldsymbol {\theta }})}{\displaystyle \log p(\mathbf {X} \mid {\boldsymbol {\theta }})} 至少有同样的改进。</nowiki>
     
12

个编辑