更改

数据分析 (查看源代码)

2020年9月1日 (二) 09:27的版本

添加4,886字节、 2020年9月1日 (二) 09:27

第1,164行：第1,164行：

在教育方面，大多数教育工作者都可以使用数据系统来分析学生的数据。这些数据系统以'''场外交易数据格式over-the-counter data format'''（嵌入标签、补充文件和帮助系统，并作出关键的包装 / 展示和内容决策）向教育工作者提供数据以提高其数据分析的准确性。

−

==Practitioner notes==

+

==Practitioner notes 从业者注意事项==

This section contains rather technical explanations that may assist practitioners but are beyond the typical scope of a Wikipedia article.

第1,170行：第1,170行：

This section contains rather technical explanations that may assist practitioners but are beyond the typical scope of a Wikipedia article.

−

~~这个部分包含了一些技术性的解释，可能对从业者有所帮助，但是超出了维基百科文章的典型范围。~~

+

这个部分包含了一些技术性的解释，它们可能对从业者有所帮助，但是超出了维基百科文章的典型范围。

−

===Initial data analysis===

+

===Initial data analysis 初始数据分析===

The most important distinction between the initial data analysis phase and the main analysis phase, is that during initial data analysis one refrains from any analysis that is aimed at answering the original research question. The initial data analysis phase is guided by the following four questions:{{sfn|Adèr|2008a|p=337}}

第1,180行：第1,180行：

The most important distinction between the initial data analysis phase and the main analysis phase, is that during initial data analysis one refrains from any analysis that is aimed at answering the original research question. The initial data analysis phase is guided by the following four questions:

−

在初始数据分析阶段和主要分析阶段之间最重要的区别是，在初始数据分析阶段，人们不进行任何旨在回答原始研究问题的分析。初始数据分析阶段以下列四个问题为指导:

+

在初始数据分析阶段和主要分析阶段之间最重要的区别是，在初始数据分析阶段，人们不进行任何旨在回答原始研究问题的分析。初始数据分析阶段由下列四个问题指导:

−

====Quality of data====

+

====Quality of data 数据质量====

The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using different types of analysis: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms), n: variables are compared with coding schemes of variables external to the data set, and possibly corrected if coding schemes are not comparable.

第1,190行：第1,190行：

The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using different types of analysis: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms), n: variables are compared with coding schemes of variables external to the data set, and possibly corrected if coding schemes are not comparable.

−

~~应尽早检查数据的质量。数据质量可以通过几种方式进行评估，使用不同类型的分析~~: ~~频率计数、描述统计学(平均值、标准差、中位数)、正态性(偏态、峰度、频率直方图)、~~ n: ~~变量与数据集外部变量的编码方案进行比较，如果编码方案不具有可比性，则可能进行修正。~~

+

应尽早检查数据的质量。数据质量可以通过几种方式，使用不同类型的分析进行评估: 频数、描述统计学量（平均值、标准差、中位数）、正态性（偏态、峰度、频率直方图）、 n: 变量与数据集外部变量的编码方案进行比较，如果和编码方案不具有可比性，则可能对数据进行修正。

*Test for [[common-method variance]].

+

* 检验'''共同方法变异common-method variance'''

The choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses that will be conducted in the main analysis phase.{{sfn|Adèr|2008a|pp=338-341}}

第1,198行：第1,200行：

The choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses that will be conducted in the main analysis phase.

−

~~在初始数据分析阶段评估数据质量的分析的选择取决于将在主要分析阶段进行的分析。~~

+

在初始数据分析阶段，评估数据质量的分析方法的选择取决于将在主要分析阶段进行的分析。

−

====Quality of measurements====

+

====Quality of measurements 测量的质量====

The quality of the [[measuring instrument|measurement instruments]] should only be checked during the initial data analysis phase when this is not the focus or research question of the study. One should check whether structure of measurement instruments corresponds to structure reported in the literature.

第1,208行：第1,210行：

The quality of the measurement instruments should only be checked during the initial data analysis phase when this is not the focus or research question of the study. One should check whether structure of measurement instruments corresponds to structure reported in the literature.

−

~~测量仪器的质量只能在初始数据分析阶段进行检验，这不是本研究的重点或研究问题。应该检查测量仪器的结构是否符合文献报道的结构。~~

+

当测量仪器的质量不是研究的重点或研究问题的时候，它只能在初始数据分析阶段进行检验。从业者应该检查测量仪器的结构是否符合文献报告的结构。

第1,220行：第1,222行：

*Analysis of homogeneity ([[internal consistency]]), which gives an indication of the [[Reliability (statistics)|reliability]] of a measurement instrument. During this analysis, one inspects the variances of the items and the scales, the [[Cronbach's alpha|Cronbach's α]] of the scales, and the change in the Cronbach's alpha when an item would be deleted from a scale{{sfn|Adèr|2008a|pp=341-342}}

+

* 同质性检验（'''内部一致性internal consistency'''）用来表示测量仪器的'''可靠性Reliability'''。在这个分析过程中，我们会检查各个项目的变异和量尺刻度，量尺的'''克隆巴赫α系数 Cronbach’s alpha ''' ，以及当一个项目从量尺上被删除时克隆巴赫α系数的变化。

−

====Initial ~~transformations~~====

+

====Initial transformations初始的转换====

After assessing the quality of the data and of the measurements, one might decide to impute missing data, or to perform initial transformations of one or more variables, although this can also be done during the main analysis phase.{{sfn|Adèr|2008a|p=344}}

第1,228行：第1,232行：

After assessing the quality of the data and of the measurements, one might decide to impute missing data, or to perform initial transformations of one or more variables, although this can also be done during the main analysis phase.

−

在对数据和测量数据的质量进行评估之后，人们可能会决定填补缺失的数据，或者对一个或多个变量进行初始转换，尽管这也可以在主要分析阶段进行。 Br /

+

在对数据和测量的质量进行评估之后，从业者可能会决定填补缺失的数据，或者对一个或多个变量进行'''初始的转换initial transformations'''，尽管这也可以在主要分析阶段进行。

Possible transformations of variables are:<ref>Tabachnick & Fidell, 2007, p. 87-88.</ref>

第1,234行：第1,238行：

Possible transformations of variables are:

−

~~变量的可能转换如下~~:

+

变量可能的转换如下:

*Square root transformation (if the distribution differs moderately from normal)

+

* 平方根转换（如果数据分布与正态分布略有不同）

*Log-transformation (if the distribution differs substantially from normal)

+

* 对数转换（如果数据分布与正态分布大不相同）

*Inverse transformation (if the distribution differs severely from normal)

+

* 倒数转换（如果分布与正态分布严重不同）

*Make categorical (ordinal / dichotomous) (if the distribution differs severely from normal, and no transformations help)

+

* 分类变量处理（顺序或二元变量）（如果分布与正态分布严重不同，且没有转换方法可以补救）

−

+

====Did the implementation of the study fulfill the intentions of the research design? 研究的实施是否完成了研究设计的目的？====

−

====Did the implementation of the study fulfill the intentions of the research design?====

One should check the success of the [[randomization]] procedure, for instance by checking whether background and substantive variables are equally distributed within and across groups.

第1,252行：第1,259行：

One should check the success of the randomization procedure, for instance by checking whether background and substantive variables are equally distributed within and across groups.

−

~~人们应该检查随机化程序是否成功，例如通过检查背景变量和实质变量是否在组内和组间均匀分布。Br~~ /

+

从业者应该检查随机化程序是否成功，例如通过检查'''背景变量background variables'''和'''实质变量substantive variables'''是否在组内和组间均匀分布。

If the study did not need or use a randomization procedure, one should check the success of the non-random sampling, for instance by checking whether all subgroups of the population of interest are represented in sample.

第1,258行：第1,265行：

If the study did not need or use a randomization procedure, one should check the success of the non-random sampling, for instance by checking whether all subgroups of the population of interest are represented in sample.

−

~~如果研究不需要或不使用随机化程序，则应检查非随机抽样的成功与否，例如通过检查样本中是否代表了相关人口的所有分组。 Br /~~

+

如果研究不需要或不使用随机化程序，则应检查非随机抽样是否成功，例如检查样本是否代表了相关总体的所有分组。

Other possible data distortions that should be checked are:

第1,264行：第1,271行：

Other possible data distortions that should be checked are:

−

~~应该检查的其他可能的数据扭曲有~~:

+

其他应该检查的可能的数据扭曲有:

*[[Dropout (electronics)|dropout]] (this should be identified during the initial data analysis phase)

第1,271行：第1,278行：

*Treatment quality (using [[manipulation check]]s).{{sfn|Adèr|2008a|pp=344-345}}

+

* '''数据遗失Dropout'''（这应该在初始数据分析阶段进行识别）

+

* 项目'''回收率Response rate'''（是否随机，应在初始数据分析阶段进行评估）

+

* '''操纵质量Treatment quality''' （使用'''操纵检验manipulation check'''）{{ sfn | Adèr | 2008a | pp = 344-345}}

−

====Characteristics of data ~~sample~~====

+

====Characteristics of data sample数据样本的特征====

In any report or article, the structure of the sample must be accurately described. It is especially important to exactly determine the structure of the sample (and specifically the size of the subgroups) when subgroup analyses will be performed during the main analysis phase.

第1,280行：第1,293行：

In any report or article, the structure of the sample must be accurately described. It is especially important to exactly determine the structure of the sample (and specifically the size of the subgroups) when subgroup analyses will be performed during the main analysis phase.

−

~~在任何报告或文章中，样品的结构必须被准确描述。在主要分析阶段进行子群分析时，准确确定样本的结构(特别是子群的大小)尤为重要。 Br~~ /

+

在任何报告或文章中，样本的结构必须被准确描述。在主要分析阶段进行'''子群subgroups'''分析时，准确确定样本的结构（特别是子群的大小）尤为重要。

The characteristics of the data sample can be assessed by looking at:

第1,286行：第1,299行：

The characteristics of the data sample can be assessed by looking at:

−

~~数据样本的特征可以通过以下方式进行评估~~:

+

数据样本的特征可以通过以下几个方面进行评估:

*Basic statistics of important variables

+

* 重要变量的基本统计学特征

*Scatter plots

+

* 散点图

*Correlations and associations

+

* 相关和联系

*Cross-tabulations{{sfn|Adèr|2008a|p=345}}

+

* '''交叉表Cross-tabulations'''

−

+

====Final stage of the initial data analysis 初始数据分析的最后阶段====

−

====Final stage of the initial data analysis====

During the final stage, the findings of the initial data analysis are documented, and necessary, preferable, and possible corrective actions are taken.

第1,304行：第1,320行：

During the final stage, the findings of the initial data analysis are documented, and necessary, preferable, and possible corrective actions are taken.

−

~~在最后阶段，记录初始数据分析的结果，并采取必要的、可取的和可能的纠正措施。 Br /~~

+

在最后阶段，从业者需要记录初始数据分析的结果，并采取必要的、可取的和可能的纠正措施。

Also, the original plan for the main data analyses can and should be specified in more detail or rewritten. In order to do this, several decisions about the main data analyses can and should be made:

第1,313行：第1,329行：

*In the case of non-[[Normal distribution|normal]]s: should one [[Data transformation (statistics)|transform]] variables; make variables categorical (ordinal/dichotomous); adapt the analysis method?

+

* 在非正态分布的情况下，是否应该：有数据转换变量；使变量分类化（序列化/二分类）；改进分析方法？

*In the case of [[missing data]]: should one neglect or impute the missing data; which imputation technique should be used?

+

* 在有缺失数据的情况下：是否应该忽视或插补缺失数据？应该使用哪种插补方法？

*In the case of [[outlier]]s: should one use robust analysis techniques?

+

* 在有异常值的情况下：是否应该使用稳健的分析技术？

*In case items do not fit the scale: should one adapt the measurement instrument by omitting items, or rather ensure comparability with other (uses of the) measurement instrument(s)?

+

*在项目和测量量尺不匹配的情况下：是否应省略项目以对测量仪器进行调整，还是应确保与其他（用途的）测量仪器具有可比性？

*In the case of (too) small subgroups: should one drop the hypothesis about inter-group differences, or use small sample techniques, like exact tests or [[bootstrapping (statistics)|bootstrapping]]?

+

* 在具有（太）小的子群的情况下：是否应该放弃群体间差异的假设，或者使用小样本技术，比如'''精确检验exact tests'''或者bootstrapping？

*In case the [[randomization]] procedure seems to be defective: can and should one calculate [[Propensity score matching|propensity scores]] and include them as covariates in the main analyses?{{sfn|Adèr|2008a|pp=345-346}}

+

* 在随机化程序似乎有缺陷的情况下：能够而且应该计算'''倾向分数propensity scores'''并将其作为协变量包括在主要分析中吗？

−

====Analysis====

+

====Analysis 分析====

Several analyses can be used during the initial data analysis phase:{{sfn|Adèr|2008a|pp=346-347}}

第1,335行：第1,358行：

*Univariate statistics (single variable)

+

* 单变量统计学（单变量）

*Bivariate associations (correlations)

+

* 双变量关联（相关）

*Graphical techniques (scatter plots)

−

+

* 图表技术（散点图）

第1,346行：第1,371行：

It is important to take the measurement levels of the variables into account for the analyses, as special statistical techniques are available for each level:

−

~~在进行分析时必须考虑到变量的衡量水平，因为每个水平都有专门的统计技术~~:

+

在进行分析时必须考虑到变量的测量水平，因为每个水平都有专门的统计技术:

*Nominal and ordinal variables

+

* 称名和顺序变量

**Frequency counts (numbers and percentages)

+

** 频数（数和百分比）

**Associations

+

** 相关

***circumambulations (crosstabulations)

+

*** '''circumambulations '''（交叉表crosstabulations）

***hierarchical loglinear analysis (restricted to a maximum of 8 variables)

+

*** '''层次对数线性分析hierarchical loglinear analysis'''（最多只能有8个变量）

***loglinear analysis (to identify relevant/important variables and possible confounders)

+

*** '''对数线性分析 loglinear analysis'''（确定相关/重要的变量及可能的混淆因素）

**Exact tests or bootstrapping (in case subgroups are small)

+

** '''精确检验exact tests'''或者bootstrapping（当子群很小时）

**Computation of new variables

+

** 计算新变量

*Continuous variables

+

* 连续变量

**Distribution

+

** 分布

***Statistics (M, SD, variance, skewness, kurtosis)

+

*** 统计学量（均值M，标准差SD，方差，偏度，峰度）

***Stem-and-leaf displays

+

*** '''茎叶图Stem-and-leaf displays'''

***Box plots

+

*** '''箱图Box plots'''

−

====Nonlinear ~~analysis~~====

+

====Nonlinear analysis非线性分析====

Nonlinear analysis is often necessary when the data is recorded from a [[nonlinear system]]. Nonlinear systems can exhibit complex dynamic effects including [[bifurcation theory|bifurcations]], [[chaos theory|chaos]], [[harmonics]] and [[subharmonics]] that cannot be analyzed using simple linear methods. Nonlinear data analysis is closely related to [[nonlinear system identification]].<ref name="SAB1">Billings S.A. "Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains". Wiley, 2013</ref>

第1,384行：第1,422行：

Nonlinear analysis is often necessary when the data is recorded from a nonlinear system. Nonlinear systems can exhibit complex dynamic effects including bifurcations, chaos, harmonics and subharmonics that cannot be analyzed using simple linear methods. Nonlinear data analysis is closely related to nonlinear system identification.

−

非线性分析通常是必要的时候，数据是记录从一个非线性。非线性系统可以表现出复杂的动力学效应，包括分岔、混沌、谐波和次谐波，这些效应不能用简单的线性方法进行分析。非线性数据分析与非线性系统辨识密切相关。

+

'''非线性分析Nonlinear analysis'''通常在数据是从非线性系统中获取的时候是必要的。非线性系统可以表现出复杂的动力学效应，包括'''分岔bifurcations'''、'''混沌chaos'''、'''谐波harmonics'''和'''次谐波subharmonics'''，这些效应不能用简单的线性方法进行分析。非线性数据分析与非线性系统辨识密切相关。

−

+

''''''

−

===Main data analysis===

+

===Main data analysis 主要数据饭呢西===

In the main analysis phase analyses aimed at answering the research question are performed as well as any other relevant analysis needed to write the first draft of the research report.{{sfn|Adèr|2008b|p=363}}

第1,394行：第1,432行：

In the main analysis phase analyses aimed at answering the research question are performed as well as any other relevant analysis needed to write the first draft of the research report.

−

~~在主要分析阶段，进行了旨在回答研究问题的分析，以及撰写研究报告初稿所需的其他相关分析。~~

+

主要分析阶段进行旨在回答研究问题的分析，以及撰写研究报告初稿所需的其他相关分析。

−

====Exploratory and confirmatory ~~approaches~~====

+

====Exploratory and confirmatory approaches探索性和验证性方法====

In the main analysis phase either an exploratory or confirmatory approach can be adopted. Usually the approach is decided before data is collected. In an exploratory analysis no clear hypothesis is stated before analysing the data, and the data is searched for models that describe the data well. In a confirmatory analysis clear hypotheses about the data are tested.

第1,404行：第1,442行：

In the main analysis phase either an exploratory or confirmatory approach can be adopted. Usually the approach is decided before data is collected. In an exploratory analysis no clear hypothesis is stated before analysing the data, and the data is searched for models that describe the data well. In a confirmatory analysis clear hypotheses about the data are tested.

−

在主要的分析阶段，可以采用探索性或验证性的方法。通常这种方法是在收集数据之前决定的。在探索性分析中，在分析数据之前没有明确的假设，而是搜索能够很好地描述数据的模型。在验证性分析中，对数据进行了明确的假设检验。

+

在主要的分析阶段，可以采用探索性或验证性的方法。通常方法是在收集数据之前决定的。探索性分析中，分析数据之前没有明确的假设，分析人员搜索能够很好地描述数据的模型。验证性分析对数据进行明确的假设检验。

第1,412行：第1,450行：

Exploratory data analysis should be interpreted carefully. When testing multiple models at once there is a high chance on finding at least one of them to be significant, but this can be due to a type 1 error. It is important to always adjust the significance level when testing multiple models with, for example, a Bonferroni correction. Also, one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory analysis is used to find ideas for a theory, but not to test that theory as well. When a model is found exploratory in a dataset, then following up that analysis with a confirmatory analysis in the same dataset could simply mean that the results of the confirmatory analysis are due to the same type 1 error that resulted in the exploratory model in the first place. The confirmatory analysis therefore will not be more informative than the original exploratory analysis.

−

对探索性数据分析的理解应该非常谨慎。当同时测试多个模型时，发现其中至少一个模型有意义的几率很高，但这可能是由于类型1错误。在测试多个模型时，总是调整显著性水平是很重要的，例如，使用邦弗朗尼校正。另外，不应该在同一数据集中进行探索性分析和验证性分析。探索性分析是用来为一个理论寻找想法，但不是用来检验这个理论。当一个模型在一个数据集中被发现是探索性的，然后在同一个数据集中进行验证性分析，这可能仅仅意味着验证性分析的结果是由于同样的1类错误导致了探索性模型放在首位。因此，验证性分析不会比最初的探索性分析更有用。

+

对探索性数据分析的解释应该非常谨慎。当同时测试多个模型时，发现其中至少一个模型具有统计学意义显著的几率很高，但这可能是由于 Ⅰ 类错误。在测试多个模型时，总是调整显著性水平很重要，例如，使用Bonferroni校正。'''另外，不应该在同一数据集中进行探索性分析后进行验证性分析。one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. '''探索性分析是用来为一个理论寻找想法，但不是用来检验这个理论的。当一个数据集中使用探索性分析发现了一个模型，然后在同一个数据集中进行验证性分析，这可能仅仅意味着验证性分析的结果首先是由于和探索性分析中同样的 Ⅰ 类错误而导致的。因此，验证性分析不会比最初的探索性分析更有用。

−

====Stability of results====

+

====Stability of results 结果的稳定性====

It is important to obtain some indication about how generalizable the results are.{{sfn|Adèr|2008b|pp=361-371}} While this is often difficult to check, one can look at the stability of the results. Are the results reliable and reproducible? There are two main ways of doing that.

第1,422行：第1,460行：

It is important to obtain some indication about how generalizable the results are. While this is often difficult to check, one can look at the stability of the results. Are the results reliable and reproducible? There are two main ways of doing that.

−

重要的是要得到一些指示，说明这些结果是多么普遍。虽然这通常很难检验，但可以看看结果的稳定性。结果是否可靠和重现性好？有两种主要的方法来做到这一点。

+

得到一些说明这些结果有多么普遍的指标是重要的。虽然普遍性通常很难检验，但可以表示结果的稳定性的一些信息。结果是否可靠和可重复？有两种主要的方法来做到这一点。

* ''[[Cross-validation (statistics)|Cross-validation]]''. By splitting the data into multiple parts, we can check if an analysis (like a fitted model) based on one part of the data generalizes to another part of the data as well. Cross-validation is generally inappropriate, though, if there are correlations within the data, e.g. with [[panel data]]. Hence other methods of validation sometimes need to be used. For more on this topic, see [[statistical model validation]].

+

* '''交叉验证Cross-validation'''。通过将数据分成多个部分，我们可以检查基于一部分数据的分析（如拟合模型）是否也可以推广到另一部分数据。不过如果数据内部（例如与'''面板数据panel data'''）存在相关性，那么交叉验证通常是不适当的。因此，有时需要使用其他验证方法。有关此主题的更多信息，请参阅'''统计模型验证statistical model validation'''。

* ''[[Sensitivity analysis]]''. A procedure to study the behavior of a system or model when global parameters are (systematically) varied. One way to do that is via [[Bootstrapping (statistics)|bootstrapping]].

−

+

* '''敏感度分析Sensitivity analysis'''。一种在全局变量（系统地）变化时研究系统或模型的行为的程序。一种方法是通过 Bootstrapping 方法。

−

==Free software for data analysis==

嘉树

259

个编辑

更改

数据分析 (查看源代码)

2020年9月1日 (二) 09:27的版本

导航菜单

搜索