更改

跳到导航 跳到搜索
添加920字节 、 2020年9月26日 (六) 13:56
第90行: 第90行:  
其中B(t)是布朗桥。K的累积分布函数为
 
其中B(t)是布朗桥。K的累积分布函数为
 
\operatorname{Pr}(K\leq x)=1-2\sum_{k=1}^\infty (-1)^{k-1} e^{-2k^2 x^2}=\frac{\sqrt{2\pi}}{x}\sum_{k=1}^\infty e^{-(2k-1)^2\pi^2/(8x^2)},
 
\operatorname{Pr}(K\leq x)=1-2\sum_{k=1}^\infty (-1)^{k-1} e^{-2k^2 x^2}=\frac{\sqrt{2\pi}}{x}\sum_{k=1}^\infty e^{-(2k-1)^2\pi^2/(8x^2)},
 +
      第101行: 第102行:     
[[convergence of random variables|in distribution]], where ''B''(''t'') is the [[Brownian bridge]].
 
[[convergence of random variables|in distribution]], where ''B''(''t'') is the [[Brownian bridge]].
 +
 +
 +
也可以用Jacobi theta函数A表示{\displaystyle \vartheta _{01}(z=0;\tau =2ix^{2}/\pi )}.在零假设下,Kolmogorov–Smirnov检验统计量的形式及其渐近分布均由Andrey Kolmogorov发布,而分布表则由Nikolai Smirnov发布。这里可以运用有限样本中检验统计量分布的递归关系。
 +
 +
当样本来自假设分布F(x)的零假设下,
 +
{\displaystyle {\sqrt {n}}D_{n}{\xrightarrow {n\to \infty }}\sup _{t}|B(F(t))|}
 +
在其分布中,B(t)指的是布朗桥。
 +
 +
    
If ''F'' is continuous then under the null hypothesis <math>\sqrt{n}D_n</math> converges to the Kolmogorov distribution, which does not depend on ''F''. This result may also be known as the Kolmogorov theorem. The accuracy of this limit as an approximation to the exact cdf of <math>K</math> when <math>n</math> is finite is not very impressive: even when <math>n=1000</math>, the corresponding maximum error is about <math>0.9\%</math>; this error increases to <math>2.6\%</math> when <math>n=100</math> and to a totally unacceptable <math>7\%</math> when <math>n=10</math>.  However, a very simple expedient of replacing <math>x</math> by  
 
If ''F'' is continuous then under the null hypothesis <math>\sqrt{n}D_n</math> converges to the Kolmogorov distribution, which does not depend on ''F''. This result may also be known as the Kolmogorov theorem. The accuracy of this limit as an approximation to the exact cdf of <math>K</math> when <math>n</math> is finite is not very impressive: even when <math>n=1000</math>, the corresponding maximum error is about <math>0.9\%</math>; this error increases to <math>2.6\%</math> when <math>n=100</math> and to a totally unacceptable <math>7\%</math> when <math>n=10</math>.  However, a very simple expedient of replacing <math>x</math> by  
第107行: 第117行:       −
Using estimated parameters, the questions arises which estimation method should be used. Usually this would be the maximum likelihood method, but e.g. for the normal distribution MLE has a large bias error on sigma. Using a moment fit or KS minimization instead has a large impact on the critical values, and also some impact on test power. If we need to decide for Student-T data with df&nbsp;=&nbsp;2 via KS test whether the data could be normal or not, then a ML estimate based on H0 (data is normal, so using the standard deviation for scale) would give much larger KS distance, than a fit with minimum KS. In this case we should reject H0, which is often the case with MLE, because the sample standard deviation might be very large for T-2 data, but with KS minimization we may get still a too low KS to reject&nbsp;H0. In the Student-T case, a modified KS test with KS estimate instead of MLE, makes the KS test indeed slightly worse. However, in other cases, such a modified KS test leads to slightly better test power.
+
如果F是连续的,则在原假设{\displaystyle {\sqrt {n}}D_{n}}下收敛到不依赖于F的Kolmogorov分布。该结果也称为Kolmogorov定理。当n为有限时,此极限的精确度近似为K的确切累积分布函数,效果并不十分令人满意:即使n = 1000,相应的最大误差约为0.9%。此错误在100时增加到2.6%,在10时增加到完全不可接受的7%。但是,如果将x替换为
 +
{\displaystyle x+{\frac {1}{6{\sqrt {n}}}}+{\frac {x-1}{4n}}}
 +
 
 +
 
    
in the argument of the Jacobi theta function reduces these errors to  
 
in the argument of the Jacobi theta function reduces these errors to  
第117行: 第130行:  
where ''K''<sub>''α''</sub> is found from
 
where ''K''<sub>''α''</sub> is found from
   −
From the right-continuity of F(x), it follows that F(F^{-1}(t)) \geq t and F^{-1}(F(x)) \leq x  and hence, the distribution of D_{n} depends on the null distribution F(x), i.e., is no longer distribution-free as in the continuous case. Therefore, a fast and accurate method has been developed to compute the exact and asymptotic distribution of D_{n} when F(x) is purely discrete or mixed  as part of the dgof package of the R language. Major statistical packages among which SAS PROC NPAR1WAY , Stata ksmirnov  implement the KS test under the assumption that F(x) is continuous, which is more conservative if the null distribution is actually not continuous (see 
+
The asymptotic [[statistical power|power]] of this test is 1.
   −
The asymptotic [[statistical power|power]] of this test is 1.
     −
Illustration of the two-sample Kolmogorov–Smirnov statistic. Red and blue lines each correspond to an empirical distribution function, and the black arrow is the two-sample KS statistic.
+
在Jacobi theta函数的参数e中,将这些误差分别减小到0.003%,0.027%和0.27%;该精度足以满足现阶段所有实际应用,。
 +
 
 +
拟合优度检验或Kolmogorov–Smirnov检验可通过使用Kolmogorov分布的临界值来构建。当{\displaystyle n\to \infty }时,该检验是渐近有效的。如果条件为{\displaystyle {\sqrt {n}}D_{n}>K_{\alpha },\,},它会拒绝{\displaystyle \alpha }等级上原假设。
 +
即Kα为:
 +
{\displaystyle \operatorname {Pr} (K\leq K_{\alpha })=1-\alpha .\,}
 +
该渐进检测效能为1。
 +
 
    
Fast and accurate algorithms to compute the cdf <math>\operatorname{Pr}(D_n \leq x)</math> or its complement for arbitrary <math>n</math> and <math>x</math>, are available from:
 
Fast and accurate algorithms to compute the cdf <math>\operatorname{Pr}(D_n \leq x)</math> or its complement for arbitrary <math>n</math> and <math>x</math>, are available from:
  −
The Kolmogorov–Smirnov test may also be used to test whether two underlying one-dimensional probability distributions differ. In this case, the Kolmogorov–Smirnov statistic is
      
[6] and [7] for continuous null distributions with code in C and Java to be found in [6].
 
[6] and [7] for continuous null distributions with code in C and Java to be found in [6].
 
[8] for purely discrete, mixed or continuous null distribution implemented in the KSgeneral package [9] of the R project for statistical computing, which for a given sample also computes the KS test statistic and its p-value. Alternative C++ implementation is available from [8].
 
[8] for purely discrete, mixed or continuous null distribution implemented in the KSgeneral package [9] of the R project for statistical computing, which for a given sample also computes the KS test statistic and its p-value. Alternative C++ implementation is available from [8].
  −
===Test with estimated parameters===
         +
用于计算任意n和x的累积分布函数{\displaystyle \operatorname {Pr} (D_{n}\leq x)}或其补数的快速准确的算法:
 +
• 统计软件期刊2011年Journal of Statistical Software刊登的Simard R, L'Ecuyer P的文章《计算双向Kolmogorov–Smirnov分布》以及统计与概率通信期刊2017年刊登的Moscovich A, Nadler B 的文章《快速计算泊松过程的边界穿越概率》。在文章《计算双向Kolmogorov–Smirnov分布》中找到具有C和Java代码的连续零分布。
 +
• 统计软件期刊2019年Journal of Statistical Software刊登的Dimitrova DS, Kaishev VK, Tan S的文章《当潜在累积分布函数是完全离散,混合或连续时,计算Kolmogorov–Smirnov分布》和Dimitrova, Dimitrina; Kaishev, Vladimir; Tan, Senren.的文章《KSgeneral:计算(离散)连续零分布的K-S检验的P值》。对于R项目的KSgeneral软件包中实现的纯离散,混合或连续零分布,可以进行统计计算,对于给定的样本,它还可以计算KS检验统计量及其p值。或者,可以从文章《当潜在累积分布函数是完全离散,混合或连续时,计算Kolmogorov–Smirnov分布》中获得替代的C ++实现。
   −
For large samples, the null hypothesis is rejected at level \alpha if
     −
对于大样本,如果
     −
If either the form or the parameters of ''F''(''x'') are determined from the data ''X''<sub>''i''</sub> the critical values determined in this way are invalid. In such cases, [[Monte Carlo method|Monte Carlo]] or other methods may be required, but tables have been prepared for some cases. Details for the required modifications to the test statistic and for the critical values for  the [[normal distribution]] and the [[exponential distribution]] have been published,<ref name="Pearson & Hartley">{{cite book |title= Biometrika Tables for Statisticians |editor=Pearson, E. S. |editor2=Hartley, H. O. |year=1972 |volume=2 |publisher=Cambridge University Press |isbn=978-0-521-06937-3 |pages=117–123, Tables 54, 55}}</ref> and later publications also include the [[Gumbel distribution]].<ref name="Shorak & Wellner">{{cite book |title=Empirical Processes with Applications to Statistics |first1=Galen R. |last1=Shorack |first2=Jon A. |last2=Wellner |year=1986 |isbn=978-0471867258 |publisher=Wiley |page=239}}</ref> The [[Lilliefors test]] represents a special case of this for the normal distribution. The logarithm transformation may help to overcome cases where the Kolmogorov test data does not seem to fit the assumption that it came from the normal distribution.
+
=== Test with estimated parameters 用估计的参数进行测试 ===
    +
If either the form or the parameters of ''F''(''x'') are determined from the data ''X''<sub>''i''</sub> the critical values determined in this way are invalid. In such cases, [[Monte Carlo method|Monte Carlo]] or other methods may be required, but tables have been prepared for some cases. Details for the required modifications to the test statistic and for the critical values for  the [[normal distribution]] and the [[exponential distribution]] have been published, and later publications also include the [[Gumbel distribution]]. The [[Lilliefors test]] represents a special case of this for the normal distribution. The logarithm transformation may help to overcome cases where the Kolmogorov test data does not seem to fit the assumption that it came from the normal distribution.
    +
如果以数据Xi来确定F(x)的形式或参数,则以这种方式确定的临界值是无效的。在这种情况下,可能需要Monte Carlo或其他方法,但是数据表格已经做了多个情况背景的准备。目前已经发布了对测试统计量的必要修正细节以及正态分布和指数分布临界值的具体信息,以后的出版物还包括Gumbel分布。另外Lilliefors检测代表正态分布的一种特殊情况。另外为了克服Kolmogorov检验数据疑似不符合来自正态分布假设的情况,可以进行对数变换。
   −
D_{n,m}>c(\alpha)\sqrt{\frac{n + m}{n\cdot m}}.
     −
D _ { n,m } > c (alpha) sqrt { frac { n + m }{ n cdot m }.
      
Using estimated parameters, the questions arises which estimation method should be used. Usually this would be the maximum likelihood method, but e.g. for the normal distribution MLE has a large bias error on sigma. Using a moment fit or KS minimization instead has a large impact on the critical values, and also some impact on test power. If we need to decide for Student-T data with df&nbsp;=&nbsp;2 via KS test whether the data could be normal or not, then a ML estimate based on H<sub>0</sub> (data is normal, so using the standard deviation for scale) would give much larger KS distance, than a fit with minimum KS. In this case we should reject H<sub>0</sub>, which is often the case with MLE, because the sample standard deviation might be very large for T-2 data, but with KS minimization we may get still a too low KS to reject&nbsp;H<sub>0</sub>. In the Student-T case, a modified KS test with KS estimate instead of MLE, makes the KS test indeed slightly worse. However, in other cases, such a modified KS test leads to slightly better test power.
 
Using estimated parameters, the questions arises which estimation method should be used. Usually this would be the maximum likelihood method, but e.g. for the normal distribution MLE has a large bias error on sigma. Using a moment fit or KS minimization instead has a large impact on the critical values, and also some impact on test power. If we need to decide for Student-T data with df&nbsp;=&nbsp;2 via KS test whether the data could be normal or not, then a ML estimate based on H<sub>0</sub> (data is normal, so using the standard deviation for scale) would give much larger KS distance, than a fit with minimum KS. In this case we should reject H<sub>0</sub>, which is often the case with MLE, because the sample standard deviation might be very large for T-2 data, but with KS minimization we may get still a too low KS to reject&nbsp;H<sub>0</sub>. In the Student-T case, a modified KS test with KS estimate instead of MLE, makes the KS test indeed slightly worse. However, in other cases, such a modified KS test leads to slightly better test power.
    +
想要使用估计参数值,自然而然会出现应该使用哪种估计方法的问题。通常情况下,采用的是最大似然法,但对于如正态分布,最大似然法在sigma上具有较大的偏差。而使用矩量拟合或KS最小化来替代则对临界值有很大影响,并且对检验功效也有一定影响。如果我们需要通过KS测试来确定df = 2的Student-T数据是否正常,那么基于H0的最大似然率估计(数据是正常的,因此使用标度的标准偏差)会得出更大的KS距离,从而不符合最小KS的拟合。在这种情况下,我们应该拒绝H0,在最大似然法中通常是这样,因为对于T-2数据而言,样本标准偏差可能非常大,但是如果将KS最小化,我们可能会得到太低的KS而无法拒绝H0。在Student-T情况下,用KS估计而不是最大似然法来进行改进的KS检验会使其效果稍差一些。但是在其他情况下,经过改良的KS检测会会得到更好的检验功效。
      −
Where n and m are the sizes of first and second sample respectively. The value of c({\alpha}) is given in the table below for the most common levels of \alpha
  −
  −
其中 n 和 m 分别是第一和第二个样本的尺寸。下表给出了最常见的 alpha 级别的 c ({ alpha })值
  −
  −
===Discrete and mixed null distribution===
     −
 
+
=== Discrete and mixed null distribution 离散和混合零分布 ===
 
  −
{| class="wikitable"
  −
 
  −
{ | class = “ wikitable”
      
Under the assumption that <math>F(x)</math> is non-decreasing and right-continuous, with countable (possibly infinite) number of jumps, the KS test statistic can be expressed as:
 
Under the assumption that <math>F(x)</math> is non-decreasing and right-continuous, with countable (possibly infinite) number of jumps, the KS test statistic can be expressed as:
  −
|-
  −
  −
|-
  −
  −
  −
  −
| \alpha || 0.20 || 0.15 || 0.10 || 0.05 || 0.025 || 0.01 || 0.005 || 0.001
  −
  −
0.15 | 0.10 | 0.05 | 0.025 | 0.01 | 0.005 | 0.001
      
:<math>D_n= \sup_x |F_n(x)-F(x)| = \sup_{0 \leq t \leq 1} |F_n(F^{-1}(t)) - F(F^{-1}(t))|. </math>
 
:<math>D_n= \sup_x |F_n(x)-F(x)| = \sup_{0 \leq t \leq 1} |F_n(F^{-1}(t)) - F(F^{-1}(t))|. </math>
   −
|-
+
From the right-continuity of <math>F(x)</math>, it follows that <math>F(F^{-1}(t)) \geq t</math> and <math>F^{-1}(F(x)) \leq x </math> and hence, the distribution of <math>D_{n}</math> depends on the null distribution <math>F(x)</math>, i.e., is no longer distribution-free as in the continuous case. Therefore, a fast and accurate method has been developed to compute the exact and asymptotic distribution of <math>D_{n}</math> when <math>F(x)</math> is purely discrete or mixed , implemented in C++ and in the KSgeneral package of the [[R (programming language)|R language]]. The functions <code>disc_ks_test()</code>, <code>mixed_ks_test()</code> and <code>cont_ks_test()</code> compute also the KS test statistic and p-values for purely discrete, mixed or continuous null distributions and arbitrary sample sizes. The KS test and its p-values for discrete null distributions and small sample sizes are also computed in as part of the dgof package of the R language. Major statistical packages among which [[SAS (software)|SAS]] <code>PROC NPAR1WAY</code> , [[Stata]] <code>ksmirnov</code> implement the KS test under the assumption that <math>F(x)</math> is continuous, which is more conservative if the null distribution is actually not continuous (see [15] [16] [17]).
 
  −
|-
  −
 
  −
 
  −
 
  −
| c({\alpha}) || 1.073 || 1.138 || 1.224 || 1.358 || 1.48 || 1.628 || 1.731 || 1.949
  −
 
  −
1.073 | 1.138 | 1.224 | 1.358 | 1.48 | 1.628 | 1.731 | 1.949
  −
 
  −
From the right-continuity of <math>F(x)</math>, it follows that <math>F(F^{-1}(t)) \geq t</math> and <math>F^{-1}(F(x)) \leq x </math> and hence, the distribution of <math>D_{n}</math> depends on the null distribution <math>F(x)</math>, i.e., is no longer distribution-free as in the continuous case. Therefore, a fast and accurate method has been developed to compute the exact and asymptotic distribution of <math>D_{n}</math> when <math>F(x)</math> is purely discrete or mixed <ref name=DKT2019/>, implemented in C++ and in the KSgeneral package <ref name=KSgeneral/> of the [[R (programming language)|R language]]. The functions <code>disc_ks_test()</code>, <code>mixed_ks_test()</code> and <code>cont_ks_test()</code> compute also the KS test statistic and p-values for purely discrete, mixed or continuous null distributions and arbitrary sample sizes. The KS test and its p-values for discrete null distributions and small sample sizes are also computed in <ref name=arnold-emerson>{{Cite journal |first1=Taylor B. |last1=Arnold |first2=John W. |last2=Emerson |year=2011 |title=Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions |journal=The R Journal |volume=3 |issue=2 |pages=34\[Dash]39 |url=http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Arnold+Emerson.pdf |doi=10.32614/rj-2011-016}}</ref> as part of the dgof package of the R language. Major statistical packages among which [[SAS (software)|SAS]] <code>PROC NPAR1WAY</code> <ref>{{cite web|url=https://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_npar1way_toc.htm|title=SAS/STAT(R) 14.1 User's Guide|author=|date=|website=support.sas.com|accessdate=14 April 2018}}</ref>, [[Stata]] <code>ksmirnov</code> <ref>{{cite web|url=https://www.stata.com/manuals15/rksmirnov.pdf|title=ksmirnov — Kolmogorov–Smirnov equality-of-distributions test|author=|date=|website=stata.com|accessdate=14 April 2018}}</ref> implement the KS test under the assumption that <math>F(x)</math> is continuous, which is more conservative if the null distribution is actually not continuous (see <ref name=Noether63>{{Cite journal |vauthors=Noether GE |year=1963|title=Note on the Kolmogorov Statistic in the Discrete Case |journal=Metrika |volume=7 |issue=1 |pages=115–116|url= |doi=10.1007/bf02613966}}</ref>
  −
 
  −
|}
  −
 
  −
|}
     −
<ref name=Slakter65>{{Cite journal |vauthors=Slakter MJ |year=1965|title=A Comparison of the Pearson Chi-Square and Kolmogorov Goodness-of-Fit Tests with Respect to Validity |journal=Journal of the American Statistical Association |volume=60 |issue=311 |pages=854–858 |url= |doi=10.2307/2283251|jstor=2283251}}</ref>
+
假设F(x)是非递减且右连续的,且具有可数(可能是无限)的跳跃次数,则KS检验统计量可表示为:
   −
<ref name=Walsh63>{{Cite journal |vauthors=Walsh JE  |year=1963 |title=Bounded Probability Properties of Kolmogorov–Smirnov and Similar Statistics for Discrete Data |journal=Annals of the Institute of Statistical Mathematics |volume=15 |issue=1 |pages=153–158|url= |doi=10.1007/bf02865912}}</ref>).
+
{\displaystyle D_{n}=\sup _{x}|F_{n}(x)-F(x)|=\sup _{0\leq t\leq 1}|F_{n}(F^{-1}(t))-F(F^{-1}(t))|.}
   −
and in general by
     −
一般来说
+
从F(x)的右连续性,可以得出{\displaystyle F(F^{-1}(t))\geq t}和{\displaystyle F^{-1}(F(x))\leq x},因此Dn的分布取决于零分布F(x),即在连续情况不再无分布。目前已经开发出一种快速,准确的方法,以C ++和R语言的KSgeneral软件包来实现,当F(x)是纯离散或混合时,可以计算Dn的精确且渐近分布。函数disc_ks_test(),mixed_ks_test()和cont_ks_test()还针对纯离散,混合或连续的零分布和任意样本大小,计算KS检测统计量和p值。作为R语言的dgof软件包的一部分,还可以计算出KS检测及其用于离散零分布和小样本量的p值。另外主要统计软件包,其中SAS PROC NPAR1WAY和Stata ksmirnov是假设F(x)是连续的,因此执行KS检验时,如果零分布实际上不是连续的,则该检验更为保守。详情请见:
 +
1. 《关于离散案例中的Kolmogorov统计量的注释Note on the Kolmogorov Statistic in the Discrete Case》
 +
2. 《皮尔逊卡方检验和Kolmogorov拟合优度检验在有效性方面的比较A Comparison of the Pearson Chi-Square and Kolmogorov Goodness-of-Fit Tests with Respect to Validity》
 +
3. 《Kolmogorov–Smirnov的有限概率性质和离散数据的相似统计量Bounded Probability Properties of Kolmogorov–Smirnov and Similar Statistics for Discrete Data》
    
==Two-sample Kolmogorov–Smirnov test==
 
==Two-sample Kolmogorov–Smirnov test==
961

个编辑

导航菜单