更改

数据挖掘 (查看源代码)

2020年8月29日 (六) 17:52的版本

添加2,050字节、 2020年8月29日 (六) 17:52

→‎发展过程 Process

第101行：第101行：

Transformation

−

转变

+

转换

# ''Data mining''

第113行：第113行：

Interpretation/evaluation.

−

口译 / 评估。

+

解释 / 评估。

第121行：第121行：

It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:

−

~~然而，它存在于这个主题的许多变体中，例如用于数据挖掘的跨行业标准流程(CRISP~~-DM~~) ，它定义了六个阶段:~~

+

然而，它存在于这个主题的许多变体中，例如在'''数据挖掘的跨行业标准流程 Cross-industry standard process for data mining，CRISP-DM'''中它定义了以下六个阶段：

−

第147行：第146行：

Modeling

−

模特

+

建模

# Evaluation

第167行：第166行：

or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.

−

~~或一个简化的过程，如~~(1)预处理，(2)数据挖掘，(3)结果验证。

+

或一个简化的过程，包括：(1)预处理，(2)数据挖掘，(3)结果验证。

第175行：第174行：

Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners. The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models, and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.

−

~~2002年、2004年、2007年和2014年进行的调查显示，CRISP~~-~~DM 方法是数据挖掘者使用的主要方法。在这些民意测验中唯一的其他数据挖掘标准是 SEMMA。然而，使用 CRISP~~-~~DM 的人数是其他人的3~~-~~4倍。几个研究团队发表了数据挖掘过程模型的评论，阿泽维多和桑托斯在2008年对 CRISP-DM 和 sema 进行了比较。~~

+

在这些调查中，唯一的其他数据挖掘标准是SEMMA。然而，使用CRISP-DM的人数是其3-4倍。一些研究小组已经发表了关于数据挖掘过程模型的研究，例如阿泽维多 Azevedo和桑托斯Santos曾在2008年对CRISP-DM和SEMMA这两套数据挖掘流程标准进行了比较。

−

===预处理 Pre-processing===

第185行：第182行：

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.

−

在使用数据挖掘算法之前，必须先组装目标数据集。由于数据挖掘只能发现数据中实际存在的模式，目标数据集必须足够大以包含这些模式，同时保持足够简洁以便在可接受的时间限制内进行挖掘。数据的公共源是数据集市或数据仓库。在数据挖掘之前，对多变量数据集进行预处理是必不可少的。然后清理目标集。数据清理去除了包含噪声的观测值和缺失数据的观测值。

+

在使用数据挖掘算法之前，必须先对目标数据集进行整合。由于数据挖掘只能发现数据中实际存在的模式，目标数据集必须足够大以包含这些模式，同时保持足够简洁以便在可接受的时间限制内进行挖掘。数据的公共源是数据集市或数据仓库。在数据挖掘之前，对多变量数据集进行预处理是必不可少的。然后清理目标集。数据清理去除了包含噪声的观测值和缺失数据的观测值。

+

在使用数据挖掘算法之前，必须组装目标数据集。由于数据挖掘只能发现数据中实际存在的模式，因此目标数据集必须足够大以包含这些模式，同时保持足够简洁，以便在可接受的时间限制内进行挖掘。数据的常见来源是'''数据集市 Data Mart'''或'''数据仓库 Data Warehouse'''。在数据挖掘之前，对'''多元 Multivariate'''数据集进行预处理是必不可少的，然后对目标集进行清洗。数据清洗将删除包含'''噪声 Noise'''的观测值和'''缺失数据 Missing Data'''的观测值。

===数据挖掘 Data mining===

第199行：第198行：

* [[Anomaly detection]] (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.

−

'''异常检测 Anomaly detection'''（异常值/变化/~~偏差检测）：识别异常数据记录，可能是有趣的或需要进一步调查的数据错误。~~

+

'''异常检测 Anomaly detection'''（异常值/变化/偏差检测）：识别异常数据记录，发现可能是有趣的或需要进一步调查的数据错误。

* [[Association rule learning]] (dependency modeling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.

−

'''关联规则学习 Association rule learning'''（依赖关系建模）：搜索变量之间的关系。例如，超市可能会收集顾客购买习惯的数据。使用关联规则学习，超市可以确定哪些产品经常一起购买，并将这些信息用于营销目的。这有时被称为市场篮子分析。

+

'''关联规则学习 Association rule learning'''（依赖关系建模）：探变量之间的关系。例如，超市可能会收集顾客购买习惯的数据。通过使用关联规则学习，超市可以确定哪些产品经常被一起购买，并将这些信息用于营销策略改进。这种研究有时被称为“市场篮子分析”。

* [[Cluster analysis|Clustering]] – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.

第210行：第209行：

* [[Statistical classification|Classification]] – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".

+

'''分类 Classification'''：是将已知结构归纳为新数据的任务。例如，电子邮件程序可能会尝试将电子邮件分类为“合法”或“垃圾邮件”。

* [[Regression analysis|Regression]] – attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets.

+

'''回归'''：试图找到一个对数据建模误差最小的函数，也就是说，用于估计数据或数据集之间的关系。

* [[Automatic summarization|Summarization]] – providing a more compact representation of the data set, including visualization and report generation.

+

'''自动文摘 Automatic summarizatio/font>'''：提供数据集更紧凑、简洁的表示，包括可视化和报告生成。

===结果验证 Results validation===

−

[[File:Spurious correlations - spelling bee spiders.svg|thumb|An example of data produced by [[data dredging]] through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence.]]

+

[[File:Spurious correlations - spelling bee spiders.svg|thumb|An example of data produced by [[data dredging]] through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence. 一个由统计学家泰勒·维根 Tyler Vigen操作的机器人进行数据挖掘所产生的数据，显然表明在拼字比赛中获胜的最佳单词与美国被毒蜘蛛杀死的人数之间有着密切的联系。但是显然这种趋势上的相似仅仅是一个巧合。]]

−

An example of data produced by [[data dredging through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence.]]

+

An example of data produced by data dredging through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence.

−

一个由[统计学家泰勒 · 维根操作的机器人数据挖掘]产生的数据的例子，显然显示了在拼字比赛中获胜的最佳单词与美国被毒蜘蛛杀死的人数之间的密切联系。这种趋势的相似性显然是一种巧合

+

一个由统计学家泰勒·维根 Tyler Vigen操作的机器人进行数据挖掘所产生的数据，显然表明在拼字比赛中获胜的最佳单词与美国被毒蜘蛛杀死的人数之间有着密切的联系。但是显然这种趋势上的相似仅仅是一个巧合。

Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be [[Reproducibility|reproduced]] on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper [[statistical hypothesis testing]]. A simple version of this problem in [[machine learning]] is known as [[overfitting]], but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.<ref name=hawkins>{{cite journal | last1 = Hawkins | first1 = Douglas M | year = 2004 | title = The problem of overfitting | url = | journal = Journal of Chemical Information and Computer Sciences | volume = 44 | issue = 1| pages = 1–12 | doi=10.1021/ci0342472| pmid = 14741005 }}</ref>

第227行：第232行：

Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.

−

~~数据挖掘可能无意中被误用，然后可能产生看起来重要的结果~~; 但这些结果实际上并不能预测未来的行为，不能在新的数据样本上重现，而且几乎没有用处。这通常是由于调查了太多的假设，而没有进行适当的统计假设检验。在机器学习中，这个问题的一个简单版本被称为过度拟合，但是同样的问题可能出现在过程的不同阶段，因此，列车 / ~~测试的拆分(如果适用的话)可能不足以防止这种情况的发生。~~

+

数据挖掘可能会在无意中被误用，然后产生看似重要的结果; 但这些结果实际上并不能用来预测未来的行为，也不能在新的数据样本上进行复现，而且用处不大。这通常是由于调查了太多的假设，而没有进行适当的'''统计假设检验 Statistical Hypothesis Testing'''。在机器学习中，这种问题可以被简称为'''过拟合 Overfitting'''，但相同的问题可能会在过程的不同阶段出现，因此，在完全适用的情况下，合理进行训练/测试分割这一种方法可能不足以防止这种情况的发生。

−

{{Missing information|section|non-classification tasks in data mining. It only covers [[machine learning]]|date=September 2011}}

第237行：第240行：

The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as ROC curves.

−

从数据中发现知识的最后一步是验证数据挖掘算法产生的模式是否存在于更广泛的数据集中。并非数据挖掘算法发现的所有模式都必然有效。数据挖掘算法通常在训练集中发现一般数据集中不存在的模式。这就是所谓的过度装配。为了克服这个问题，评估使用一组测试数据，而数据挖掘算法没有在这些测试数据上进行训练。学习的模式应用到这个测试集中，并将结果输出与所需的输出进行比较。例如，试图区分“垃圾邮件”和“合法”邮件的数据挖掘算法将根据一组样本电子邮件进行训练。一旦经过训练，学到的模式将应用于未经训练的电子邮件测试集。模式的准确性可以通过他们正确分类的电子邮件数量来衡量。几种统计方法可以用来评估算法，如 ROC ~~曲线。~~

+

从数据中发现知识的最后一步是验证数据挖掘算法产生的模式是否存在于更广泛的数据集中。数据挖掘算法发现的并非所有模式都是有效的，因为对于数据挖掘算法来说，在训练集中发现一般数据集中没有的模式是很常见的，这叫做'''过拟合 Overfitting'''。为了克服这个问题，评估使用一组测试数据，而数据挖掘算法并没有在这些测试数据上进行训练。然后将学习到的模式应用到这个'''测试集 Test Set'''中，并将结果输出与期望的输出进行比较。例如，试图区分“垃圾邮件”和“合法”邮件的数据挖掘算法将根据一组电子邮件'''训练集 Training Sett'''样本进行训练。训练完毕后，学到的模式将应用于未经训练的那部分电子邮件测试集数据上。然后，可以从这些模式正确分类的电子邮件数量来衡量这些模式的准确性。可以使用几种统计方法可以用来评估算法，如'''ROC 曲线 ROC curves'''。

−

If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.

第245行：第246行：

If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.

−

如果学习的模式不能达到预期的标准，那么就需要重新评估和修改预处理和数据挖掘的步骤。如果所学的模式确实符合所需的标准，那么最后一步就是解释所学的模式并将其转化为知识。

+

如果学习的模式不能达到预期的标准，那么就需要重新评估和修改预处理和数据挖掘的步骤。如果所学的模式确实符合所需的标准，那么最后一步就是对习得的模式进行解释并将其转化为知识。

==研究 Research==

Yillia Jing

463

个编辑

更改

数据挖掘 (查看源代码)

2020年8月29日 (六) 17:52的版本

导航菜单

搜索