更改

数据挖掘 (查看源代码)

2020年9月14日 (一) 00:23的版本

添加5,281字节、 2020年9月14日 (一) 00:23

无编辑摘要

第1行：第1行：

此词条由许菁翻译整理。

−

+

此词条由Zengsihang审校。

第9行：第9行：

Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is the analysis step of the "knowledge discovery in databases" process or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

−

'''数据挖掘 Data Mining '''是在大型数据集中发现模式的过程，是一种涉及到机器学习、统计学和数据库系统综合使用的方法。数据挖掘是指“在数据库中知识发现KDD”过程中的分析步骤。除了传统的分析步骤，它还涉及数据库和数据管理方面，包括数据预处理、模型和推理考虑、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。

+

'''数据挖掘 Data Mining '''是在大型数据集中发现模式的过程，是一种涉及到机器学习、统计学和数据库系统综合使用的方法。数据挖掘是指“在数据库中知识发现KDD”过程中的分析步骤。除了传统的分析步骤，它还涉及数据库和数据管理方面，包括数据预处理、模型和推理考虑、'''兴趣权值考量'''、复杂性考量、发现结构的后处理、可视化和在线更新等内容。

−

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]） ~~【审校】“数据挖掘是指“数据库中的知识发现KDD”的过程的分析步骤”一句中的“在数据库中知识发现KDD”处改为“数据库中的知识发现~~(knowledge discovery in databases,KDD)”

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“数据挖掘是指“数据库中的知识发现KDD”的过程的分析步骤”一句中的“在数据库中知识发现KDD”处改为“知识发现(knowledge discovery in databases,KDD)”

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】 "是在大型数据集中发现模式的过程，是一种涉及到机器学习、统计学和数据库系统综合使用的方法。"一句改为“是一种在大型数据集中发现模式的过程，用到了机器学习、统计学和数据库系统的交叉方法。”

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】 “数据预处理、模型和推理考虑、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。”一句改为“数据预处理、'''建模'''和推理'''考量'''、兴趣度量、'''复杂性考虑、发现结构的后处理'''、可视化和在线更新等内容。”

第18行：第22行：

The term "data mining" is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.

−

“数据挖掘”这种形容其实并不十分恰当，因为我们的目标是从大量数据中提取模式和知识，而不是数据本身的提取(挖掘)。它是一个流行语，经常用于任何形式的大规模数据或信息处理（收集、提取、仓储、分析和统计）的场景下，以及''' 计算机决策系统 Decision Support System，DSS'''的任何应用当中，包括人工智能（例如机器学习）和商业智能。《数据挖掘：使用Java的实用机器学习工具和技术》（主要涵盖机器学习材料）一书最初被命名为“实用机器学习”，而数据挖掘一词只是为了营销的原因而增加。经常更一般的术语例如（大规模）数据分析和分析——或当提到实际的方法时使用人工智能和机器学习这样的词语更加合适。

+

“数据挖掘”这种形容其实并不十分恰当，因为我们的目标是从大量数据中提取模式和知识，而不是数据本身的提取(挖掘)。它是一个流行语，经常用于任何形式的大规模数据或信息处理（收集、提取、仓储、分析和统计）的场景下，以及''' 计算机决策系统 Decision Support System，DSS'''的任何应用当中，包括人工智能（例如机器学习）和商业智能。《数据挖掘：使用Java的实用机器学习工具和技术》（主要涵盖机器学习材料）一书最初被命名为《实用机器学习》，而数据挖掘一词只是为了营销的原因而增加。经常更一般的术语例如（大规模）数据分析和分析——或当提到实际的方法时使用人工智能和机器学习这样的词语更加合适。

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“经常更一般的术语例如（大规模）数据分析和分析——或当提到实际的方法时使用人工智能和机器学习这样的词语更加合适”一句改为“经常来说，更一般的术语如（大规模）数据分析，或实际的方法如人工智能和机器学习，是更合适的表达方式”

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】“'数据挖掘'这种形容其实并不十分恰当”一句改为““数据挖掘”这种形容其实并不'''太'''恰当”

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】“它是一个流行语，经常用于任何形式的大规模数据或信息处理（收集、提取、仓储、分析和统计）的场景下,以及''' 计算机决策系统 Decision Support System，DSS'''的任何应用当中，包括人工智能（例如机器学习）和商业智能。”一句改为“它是一个经常被用于各种大规模数据或信息处理（收集、提取、存储、分析和统计），以及包括人工智能（例如机器学习）和商业智能的''' 计算机决策系统 Decision Support System，DSS'''等场合的流行语”

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】“（主要涵盖机器学习材料）”一句改为“主要提供了一些机器学习的资料”

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】“而数据挖掘一词只是为了营销的原因而增加”改为“而数据挖掘一词只是为了销量更好而增加的”

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records ([[cluster analysis]]), unusual records ([[anomaly detection]]), and dependencies ([[association rule mining]], [[sequential pattern mining]]). This usually involves using database techniques such as [[spatial index|spatial indices]]. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and [[predictive analytics]]. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a [[decision support system]]. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.

第26行：第38行：

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.

−

~~实际上数据挖掘任务是对大量数据进行半自动或全自动分析，以提取出从前未知的且有趣的模式，如数据记录组~~(数据聚类)、异常记录组(异常检测)和依赖关系(关联规则挖掘，序列挖掘)。这通常涉及使用数据库技术，如空间索引。这些模式可以被看作是输入数据的一种汇总，并且可以用于进一步的分析，或者，例如，机器学习和预测分析。例如，数据挖掘步骤可以识别数据中的多个组，然后可以使用该步骤通过决策支持系统获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分，而是作为附加步骤属于整个 KDD 过程。

+

实际的数据挖掘任务是对大量数据进行半自动或全自动分析，以提取出从前未知的且有趣的模式，如数据记录组(数据聚类)、异常记录组(异常检测)和依赖关系(关联规则挖掘，序列挖掘)。这通常涉及使用数据库技术，如空间索引。这些模式可以被看作是输入数据的一种汇总，并且可以用于进一步的分析，例如机器学习和预测分析。例如，数据挖掘步骤可以识别数据中的多个组，然后可以使用该步骤通过决策支持系统获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分，而是作为附加步骤属于整个 KDD 过程。

+

如数据记录组（'''聚类分析 Cluster Analysis'''）、异常记录（'''异常检测 Anomaly Detection'''）和依赖关系（'''关联规则挖掘 Association Rule Mining'''、'''序列模式挖掘 Sequential Pattern Mining'''）。这通常涉及到使用数据库技术，如空间索引。这些模式可以被看作是输入数据的一种规律总结，可以用于进一步的分析，或者，例如，在机器学习和预测分析中。例如，通过数据挖掘可以出识别数据中的多个组，然后这些组可以通过使用决策支持系统来获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分，而是整个KDD过程附加的步骤。

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】 “以提取出从前未知的且有趣的模式”改为“以发掘从前未知的且新奇的模式”

−

如数据记录组（'''聚类分析 Cluster Analysis'''）、异常记录（'''异常检测 Anomaly Detection'''）和依赖关系（'''关联规则挖掘 Association Rule Mining'''、'''序列模式挖掘 Sequential Pattern Mining'''）。这通常涉及到使用数据库技术，如空间索引。这些模式可以被看作是输入数据的一种规律总结，可以用于进一步的分析，或者，例如，在机器学习和预测分析中。例如，通过数据挖掘可以出识别数据中的多个组，然后这些组可以通过使用决策支持系统来获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分，而是属于整个KDD过程的附加步骤。

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】“ 例如，数据挖掘步骤可以识别数据中的多个组”改为“例如数据挖掘的过程中可以把数据分成多个组”

The difference between [[data analysis]] and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.<ref>Olson, D. L. (2007). Data mining in business services. ''Service Business'', ''1''(3), 181-193. {{doi|10.1007/s11628-006-0014-7}}</ref>

第34行：第50行：

The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.

−

'''数据分析 Data Analysis'''和数据挖掘的区别在于，数据分析用于测试数据集上的模型和假设，例如，分析营销活动的有效性，而不考虑数据量的多少；相反，数据挖掘使用机器学习和统计模型来发现“大量”数据中的秘密或隐藏模式。

+

'''数据分析 Data Analysis'''和数据挖掘的区别在于，数据分析用于测试数据集上的模型和假设，例如，分析营销活动的有效性，而不是考虑数据量的多少；相反，数据挖掘使用机器学习和统计模型来发现“大量”数据中的秘密和隐藏的模式。

第42行：第58行：

The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

−

相关术语'''“数据疏浚” Data Dredging'''、“数据钓鱼”和“数据窥探”是指使用数据挖掘方法对较大的人口数据集中的一部分进行抽样，这些数据集太小（或可能太小），无法对所发现的任何模式的有效性作出可靠的统计推断。但是，这些方法可以用于提出新的假设，以针对更大的数据群体进行测试。

+

相关术语'''“数据疏浚” Data Dredging'''、“数据钓鱼”和“数据窥探”是指使用数据挖掘的方法对较大的人口数据集中的一部分进行抽样，这些数据集可能太小，无法对所发现的任何模式的有效性作出可靠的统计推断。但是，这些方法可以用于提出新的假设，以针对更大的数据群体进行测试。

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“使用数据挖掘方法对较大的人口数据集中的一部分进行抽样”中的“较大的人口数据集”改为“较大规模的数据集”

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】“无法对所发现的任何模式的有效性作出可靠的统计推断”改为“无法可靠统计推断发现模式的有效性”

+

==起源 Etymology==

第54行：第73行：

在20世纪60年代，统计学家和经济学家们曾经使用“数据钓鱼”或”数据疏浚“等术语来指代他们认为在没有先验假设的情况下进行数据分析的糟糕做法。经济学家迈克尔•洛弗尔 Michael Lovell 在1983年《经济研究评论》（Review of Economic Studies）上发表的一篇文章中，也以类似的批判方式使用了“数据挖掘”这个术语。Lovell 指出，这种做法“伪装成各种别名，从“实验”(正面)到“钓鱼”或“窥探”(负面)。

−

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】“这种做法“伪装成各种别名，从“实验”(正面)到“钓鱼”或“窥探”(负面)。”改为“这种做法有很多别名，比如正面说法"实验",负面说法“钓鱼”、“窥探”等。

The term ''data mining'' appeared around 1990 in the database community, generally with positive connotations. For a short time in 1980s, a phrase "database mining"™, was used, but since it was trademarked by HNC, a San Diego-based company, to pitch their Database Mining Workstation;<ref name="Mena">{{cite book |last=Mena |first=Jesús |year=2011 |title=Machine Learning Forensics for Law Enforcement, Security, and Intelligence |location=Boca Raton, FL |publisher=CRC Press (Taylor & Francis Group) |isbn=978-1-4398-6069-4 }}</ref> researchers consequently turned to ''data mining''. Other terms used include ''data archaeology'', ''information harvesting'', ''information discovery'', ''knowledge extraction'', etc. [[Gregory I. Piatetsky-Shapiro|Gregory Piatetsky-Shapiro]] coined the term "knowledge discovery in databases" for the first workshop on the same topic [http://www.kdnuggets.com/meetings/kdd89/ (KDD-1989)] and this term became more popular in [[Artificial intelligence|AI]] and [[machine learning]] community. However, the term data mining became more popular in the business and press communities.<ref>{{cite web |last1=Piatetsky-Shapiro |first1=Gregory |authorlink1=Gregory Piatetsky-Shapiro |last2=Parker |first2=Gary |url=http://www.kdnuggets.com/data_mining_course/x1-intro-to-data-mining-notes.html |title=Lesson: Data Mining, and Knowledge Discovery: An Introduction |publisher=KD Nuggets |year=2011 |work=Introduction to Data Mining |accessdate=30 August 2012 }}</ref> Currently, the terms ''data mining'' and ''knowledge discovery'' are used interchangeably.

第60行：第79行：

The term data mining appeared around 1990 in the database community, generally with positive connotations. For a short time in 1980s, a phrase "database mining"™, was used, but since it was trademarked by HNC, a San Diego-based company, to pitch their Database Mining Workstation; researchers consequently turned to data mining. Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities. Currently, the terms data mining and knowledge discovery are used interchangeably.

−

数据挖掘这个术语在1990年左右出现在数据库领域，通常有着积极的内涵。在20世纪80年代的一段短暂时间里，人们曾使用过“数据库挖掘”这种表达，但由于这个词被总部位于圣地亚哥的 HNC 公司注册为商标，因此研究人员转向了数据挖掘。曾用过的其他术语包括数据考古学、信息收集、信息发现、知识提取等。格雷戈里·皮亚特斯基·夏皮罗 Gregory Piatetsky-Shapiro 在关于这个主题的第一个研讨会[ http://www.kdnuggets.com/meetings/kdd89/ (KDD-1989)] ~~上首次创造了“数据库中的知识发现~~ Knowledge Discovery in Databases，KDD”这个术语。此后，这个术语在人工智能和机器学习社区中变得更加流行。然而，数据挖掘这个术语在商业和出版界变得越来越流行。目前，数据挖掘和知识发现 knowledge discovery这两个术语可以互换使用。

+

数据挖掘这个术语在1990年左右出现在数据库领域，通常有着积极的内涵。在20世纪80年代的一段短暂的时间里，人们曾使用过“数据库挖掘”这种表达，但由于这个词被总部位于圣地亚哥的 HNC 公司注册为商标，因此研究人员转向了数据挖掘。曾用过的其他术语包括数据考古学、信息收集、信息发现、知识提取等。格雷戈里·皮亚特斯基·夏皮罗 Gregory Piatetsky-Shapiro 在关于这个主题的第一个研讨会[ http://www.kdnuggets.com/meetings/kdd89/ (KDD-1989)] 上首次提出了“数据库中的知识发现 Knowledge Discovery in Databases，KDD”这个术语。此后，这个术语在人工智能和机器学习领域中变得更加流行。然而，数据挖掘这个术语在商业和出版界变得越来越流行。目前，数据挖掘和知识发现 knowledge discovery这两个术语可以互换使用。

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“但由于这个词被总部位于圣地亚哥的 HNC 公司注册为商标”中的“总部位于圣地亚哥的HNC公司”改为“圣地亚哥的HNC公司”

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“这个术语在人工智能和机器学习社区中变得更加流行”中的“社区”改为“群体”

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】“数据挖掘这个术语在1990年左右出现在数据库领域，通常有着积极的内涵。”一句改为“数据挖掘这个术语在1990年左右在数据库领域出现，通常有着积极的含义"

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】“因此研究人员转向了数据挖掘”改为“因此研究人员改用了数据挖掘这个词”

In the academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery ([[KDD-95]]) was started in Montreal under [[AAAI]] sponsorship. It was co-chaired by [[Usama Fayyad]] and Ramasamy Uthurusamy. A year later, in 1996, Usama Fayyad launched the journal by Kluwer called [[Data Mining and Knowledge Discovery]] as its founding editor-in-chief. Later he started the [[SIGKDD]] Newsletter SIGKDD Explorations.<ref name=SIGKDD-explorations>{{cite journal|last1=Fayyad|first1=Usama|title=First Editorial by Editor-in-Chief|journal=SIGKDD Explorations|date=15 June 1999|volume=13|issue=1|pages=102|doi=10.1145/2207243.2207269|url=http://www.kdd.org/explorations/view/june-1999-volume-1-issue-1|accessdate=27 December 2010|ref=SIGKDD-explorations}}</ref> The KDD International conference became the primary highest quality conference in data mining with an acceptance rate of research paper submissions below 18%. The journal ''Data Mining and Knowledge Discovery'' is the primary research journal of the field.

第71行：第94行：

In the academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery (KDD-95) was started in Montreal under AAAI sponsorship. It was co-chaired by Usama Fayyad and Ramasamy Uthurusamy. A year later, in 1996, Usama Fayyad launched the journal by Kluwer called Data Mining and Knowledge Discovery as its founding editor-in-chief. Later he started the SIGKDD Newsletter SIGKDD Explorations.The KDD International conference became the primary highest quality conference in data mining with an acceptance rate of research paper submissions below 18%. The journal Data Mining and Knowledge Discovery is the primary research journal of the field.

−

~~在学术界，主要的研究论坛始于1995年，当时，在AAAI的赞助下，第一届数据挖掘和知识发现国际会议（KDD~~-95）在蒙特利尔召开。会议由乌萨马·法耶兹 Usama Fayyad和拉玛萨米·乌图鲁萨米 Ramasamy Uthurusamy共同主持。一年后，1996年Usama Fayyad创办了杂志《数据挖掘与知识发现》（datamining and Knowledge Discovery），担任创始主编。后来他创办了SIGKDD时事通讯探索。那个KDD国际会议也成为了数据挖掘领域质量最高的主要会议，其研究论文提交的接受率低于18%，而《数据挖掘与知识发现》也成为了该领域的主要研究期刊。

+

学术界主要的研究论坛始于1995年，当时，在AAAI的赞助下，第一届数据挖掘和知识发现国际会议（KDD-95）在蒙特利尔召开。会议由乌萨马·法耶兹 Usama Fayyad和拉玛萨米·乌图鲁萨米 Ramasamy Uthurusamy共同主持。一年后，1996年Usama Fayyad创办了杂志《数据挖掘与知识发现》（datamining and Knowledge Discovery），担任创始主编。后来他创办了SIGKDD时事通讯探索。KDD国际会议也成为了数据挖掘领域质量最高的主要会议，其研究论文提交的接受率低于18%，而《数据挖掘与知识发现》也成为了该领域的主要研究期刊。

+

==背景 Background==

第79行：第104行：

The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology have dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, specially in the field of machine learning, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever-larger data sets.

−

从数据中手动提取模式的方法已经持续了好几个世纪了。早期识别数据模式的方法包括17世纪的'''贝叶斯定理 Bayes' Theorem'''和19世纪的'''回归分析 Regression Analysis'''。计算机技术的扩散、其普遍性和日益强大的能力极大地提高了数据的收集、存储和操作能力。随着数据集的规模和复杂性的增长，手动分析数据的方法越来越多地被更强的间接、自动化的数据处理所取代，这都得益于计算机科学其他领域取得的新的进步，特别是机器学习领域的'''神经网络 Neural Networks'''、'''聚类分析 Cluster Analysis'''、'''遗传算法 Genetic Algorithms'''（1950年代），'''决策树 Decision Tree'''和'''决策规则 Decision Rules'''（1960年代）以及'''支持向量机 Support Vector Machines'''（1990年代）等。数据挖掘就是应用这些方法来发现大型数据集中的隐藏模式的过程。它利用数据在数据库中存储和索引的方式，更有效地执行实际的学习和发现算法，从而弥补了从应用统计学和人工智能(通常提供数学背景)到数据库管理之间的差距，使这些方法能够应用于更大的数据集。

+

从数据中手动提取模式的方法已经持续了好几个世纪了。早期识别数据模式的方法包括17世纪的'''贝叶斯定理 Bayes' Theorem'''和19世纪的'''回归分析 Regression Analysis'''。计算机技术的扩散、其普遍性和日益强大的能力极大地提高了数据的收集、存储和操作能力。随着数据集的规模和复杂性的增长，手动分析数据的方法越来越多地被更有力的间接、自动化的数据处理所取代，这都得益于计算机科学其他领域取得的新的进步，特别是机器学习领域的'''神经网络 Neural Networks'''、'''聚类分析 Cluster Analysis'''、'''遗传算法 Genetic Algorithms'''（1950年代），'''决策树 Decision Tree'''和'''决策规则 Decision Rules'''（1960年代）以及'''支持向量机 Support Vector Machines'''（1990年代）等。数据挖掘就是应用这些方法来发现大型数据集中的隐藏模式的过程。它利用数据在数据库中存储和索引的方式，更有效地执行实际的学习和发现算法，从而弥补了从应用统计学和人工智能(通常提供数学背景)到数据库管理之间的差距，使这些方法能够应用于更大的数据集。

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“手动分析数据的方法越来越多地被更强的间接、自动化的数据处理所取代”中的“手动分析数据”改为“直接、手动的分析数据”

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】“计算机技术的扩散、其普遍性和日益强大的能力”改为“计算机技术的广泛使用和其能力的日益提高

==发展过程 Process==

第89行：第115行：

The knowledge discovery in databases (KDD) process is commonly defined with the stages:

−

~~数据库中的知识发现~~ Knowledge Discovery in Databases ，KDD过程通常定义为以下几个阶段:

+

知识发现 Knowledge Discovery in Databases ，KDD过程通常定义为以下几个阶段:

第129行：第155行：

It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:

−

~~然而，它存在于这个主题的许多变体中，例如在~~'''数据挖掘的跨行业标准流程 Cross-industry standard process for data mining，CRISP-DM'''中它定义了以下六个阶段：

+

知识发现还存在于与这个主题相关的其他主题中，例如在'''数据挖掘的跨行业标准流程 Cross-industry standard process for data mining，CRISP-DM'''中它定义了以下六个阶段：

第182行：第208行：

Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners. The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models, and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.

−

~~在这些调查中，唯一的其他数据挖掘标准是SEMMA。然而，使用CRISP~~-DM的人数是其3-4倍。一些研究小组已经发表了关于数据挖掘过程模型的研究，例如阿泽维多 Azevedo和桑托斯Santos曾在2008年对CRISP-DM和SEMMA这两套数据挖掘流程标准进行了比较。

+

在这些调查中，唯一使用的其他数据挖掘标准是SEMMA。然而，使用CRISP-DM的人数是其3-4倍。一些研究小组已经发表了关于数据挖掘过程模型的研究，例如阿泽维多 Azevedo和桑托斯Santos曾在2008年对CRISP-DM和SEMMA这两套数据挖掘流程标准进行了比较。

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】开头添加“2002、2004、2007、2014年的调查显示，CRISP-DM标准是数据挖掘者最常用的标准”

第214行：第240行：

* [[Association rule learning]] (dependency modeling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.

−

'''关联规则学习 Association rule learning'''（依赖关系建模）：探变量之间的关系。例如，超市可能会收集顾客购买习惯的数据。通过使用关联规则学习，超市可以确定哪些产品经常被一起购买，并将这些信息用于营销策略改进。这种研究有时被称为“市场篮子分析”。

+

'''关联规则学习 Association rule learning'''（依赖关系建模）：探寻变量之间的关系。例如，超市可能会收集顾客购买习惯的数据。通过使用关联规则学习，超市可以确定哪些产品经常被一起购买，并将这些信息用于营销策略改进。这种研究有时被称为“市场篮子分析”。

* [[Cluster analysis|Clustering]] – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.

第248行：第274行：

Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.

−

数据挖掘可能会在无意中被误用，然后产生看似重要的结果; 但这些结果实际上并不能用来预测未来的行为，也不能在新的数据样本上进行复现，而且用处不大。这通常是由于调查了太多的假设，而没有进行适当的'''统计假设检验 Statistical Hypothesis Testing'''。在机器学习中，这种问题可以被简称为'''过拟合 Overfitting'''~~，但相同的问题可能会在过程的不同阶段出现，因此，在完全适用的情况下，合理进行训练~~/~~测试分割这一种方法可能不足以防止这种情况的发生。~~

+

数据挖掘可能会在无意中被误用，然后产生看似重要的结果; 但这些结果实际上并不能用来预测未来的行为，也不能在新的数据样本上进行复现，而且用处不大。这通常是由于做出太多的假设，而没有进行适当的'''统计假设检验 Statistical Hypothesis Testing'''。在机器学习中，这种问题可以被简称为'''过拟合 Overfitting'''，但相同的问题可能会在过程的不同阶段出现，因此哪怕在完全适用的情况下，合理进行训练/测试分割这一种方法也可能不足以防止这种情况的发生。

{{Missing information|section|non-classification tasks in data mining. It only covers [[machine learning]]|date=September 2011}}

第257行：第283行：

从数据中发现知识的最后一步是验证数据挖掘算法产生的模式是否存在于更广泛的数据集中。数据挖掘算法发现的并非所有模式都是有效的，因为对于数据挖掘算法来说，在训练集中发现一般数据集中没有的模式是很常见的，这叫做'''过拟合 Overfitting'''。为了克服这个问题，评估使用一组测试数据，而数据挖掘算法并没有在这些测试数据上进行训练。然后将学习到的模式应用到这个'''测试集 Test Set'''中，并将结果输出与期望的输出进行比较。例如，试图区分“垃圾邮件”和“合法”邮件的数据挖掘算法将根据一组电子邮件'''训练集 Training Sett'''样本进行训练。训练完毕后，学到的模式将应用于未经训练的那部分电子邮件测试集数据上。然后，可以从这些模式正确分类的电子邮件数量来衡量这些模式的准确性。可以使用几种统计方法可以用来评估算法，如'''ROC 曲线 ROC curves'''。

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】“为了克服这个问题，评估使用一组测试数据，而数据挖掘算法并没有在这些测试数据上进行训练”改为“为了解决这个问题，评估时会使用一组没有用在训练数据挖掘算法中用到的测试数据”

If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.

第307行：第335行：

为数据挖掘过程定义了一些标准，例如1999年欧洲跨行业数据挖掘标准流程（CRISP-DM 1.0）和2004年Java数据挖掘标准（JDM 1.0）。这些程序的后续程序（CRISP-DM 2.0和 JDM 2.0）的开发活跃于2006年，但此后一直停滞不前。Jdm 2.0没有达成最终草案就被撤销了。

−

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]） ~~【审校】将“为数据挖掘过程定义了一些标准”改为“人们曾努力为数据挖掘过程定义标准~~

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】将“为数据挖掘过程定义了一些标准”改为“人们曾努力为数据挖掘过程定义标准”

For exchanging the extracted models – in particular for use in [[predictive analytics]] – the key standard is the [[Predictive Model Markup Language]] (PMML), which is an [[XML]]-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) [[subspace clustering]] have been proposed independently of the DMG.<ref>{{Cite book | last1 = Günnemann | first1 = Stephan | last2 = Kremer | first2 = Hardy | last3 = Seidl | first3 = Thomas | doi = 10.1145/2023598.2023605 | chapter = An extension of the PMML standard to subspace clustering models | title = Proceedings of the 2011 workshop on Predictive markup language modeling - PMML '11 | pages = 48 | year = 2011 | isbn = 978-1-4503-0837-3 | pmid = | pmc = }}</ref>

第313行：第341行：

For exchanging the extracted models – in particular for use in predictive analytics – the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.

−

为了交换所提取的模型，特别是在预测分析中使用，关键的标准是预测模型标记语言 PMML，这是一种基于 XML 的语言，由数据挖掘集团 DMG 开发，并支持作为交换格式的许多数据挖掘应用程序。顾名思义，它只涵盖预测模型，这是一项对业务应用程序非常重要的特殊数据挖掘任务。然而，覆盖子空间聚类的扩展已经独立于 DMG 被提出。

+

为了交换所提取的模型，特别是在预测分析中使用，关键的标准是预测模型标记语言 PMML，这是一种基于 XML 的语言，由数据挖掘集团 DMG 开发，并支持作为许多数据挖掘的交换格式的应用程序。顾名思义，它只涵盖预测模型，这是一项特殊的在商业应用中非常重要的数据挖掘任务。然而，覆盖子空间聚类的扩展已经独立于 DMG 被提出。

−

==~~显著用途~~ Notable uses==

+

==主要用途 Notable uses==

第328行：第356行：

数据挖掘在任何有数字数据可用的地方都可以被使用。数据挖掘的著名例子可以在商业、医学、科学和监控领域找到。

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】 “数据挖掘的著名例子可以在商业、医学、科学和监控领域找到。”改为“在商业、医学、科学和监管领域都有数据挖掘的主要应用”

==隐私问题和道德规范 Privacy concerns and ethics==

第335行：第365行：

While the term "data mining" itself may have no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise).

−

~~虽然“数据挖掘”这个术语本身可能没有伦理含义，但它通常与人们行为（伦理和其他）相关的信息挖掘有关。~~

+

虽然“数据挖掘”这个术语本身可能没有伦理含义，但它通常与人们伦理和其他行为相关的信息挖掘有关。

第343行：第373行：

The ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics. In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.

−

在某些情况下，数据挖掘的使用方式可能会引发隐私、合法性和伦理问题。特别是，为国家安全或执法目的而进行的政府或商业数据集的数据挖掘，如在全面信息意识项目或在 ADVISE ~~中，引起了隐私问题。~~

+

在某些情况下，数据挖掘的使用方式可能会引发隐私、合法性和伦理问题。特别是，处于国家安全或执法目的而进行的政府或商业数据集的数据挖掘，如在全面信息意识项目或在 ADVISE 中引起了隐私问题。

第350行：第380行：

Data mining requires data preparation which uncovers information or patterns which compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent). This is not data mining per se, but a result of the preparation of data before – and for the purposes of – the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.

−

数据挖掘需要进行数据准备，以发现损害机密性和隐私义务的信息或模式。实现这一点的一种常见方式是通过'''数据聚合 Data Aggregation'''。数据聚合包括以一种便于分析的方式将数据（可能来自不同的来源）组合在一起（但这也可能使私人、个人级别的数据的识别变得可推断或以其他方式显而易见）。但这并不是数据挖掘本身，而是在分析之前以及为分析目的准备数据的结果。当数据被编译后，数据挖掘者或任何有权访问新编译的数据集的人能够识别特定的个人，特别是当数据最初是匿名的时，对个人隐私的威胁就开始发挥作用了。

+

数据挖掘需要进行数据准备，以发现损害机密性和隐私义务的信息或模式。实现这一点的一种常见方式是通过'''数据聚合 Data Aggregation'''。数据聚合包括以一种便于分析的方式将数据（可能来自不同的来源）组合在一起（但这也可能使私人、个人级别的数据识别变得可推断或以其他方式显而易见）。但这并不是数据挖掘导致的，而是在分析之前以及为分析目的准备数据的结果。当数据被编译后，数据挖掘者或任何有权访问新编译的数据集的人能够识别特定的个人，特别是当数据最初是匿名的时，对个人隐私的威胁就开始发挥作用了。

−

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]） ~~【审校】将“对个人隐私的威胁就开始发挥作用了”改为“就会产生对个人隐私的威胁”~~

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】将“对个人隐私的威胁就开始发挥作用了”改为“就会对个人隐私产生威胁”

It is recommended{{whom|date=August 2019}} to be aware of the following '''before''' data are collected:<ref name="NASCIO" />

第385行：第415行：

数据也可以被修改成匿名的，这样个人就不容易被修改了确定。但是，甚至“匿名化”的数据集也可能包含足够的信息用来识别个人，就像记者能够根据一组无意中搜索历史找到几个个人一样美国在线发布。

−

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]） ~~【审校】将“这样个人就不容易被修改了确定”改为“这样个人就不会轻易的被识别”~~

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】将“这样个人就不容易被修改了确定”改为“这样个人就不会轻易地被识别”

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】将“就像记者能够根据一组无意中搜索历史找到几个个人一样美国在线发布”改为“就像记者能够依据‘美国在线’无意中发布的用户历史记录找到一些个人”

第406行：第436行：

Europe has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.-E.U. Safe Harbor Principles, developed between 1998 and 2000, currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of Edward Snowden's global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement with the United States have failed.

−

~~欧洲有相当强的隐私法，正在努力进一步加强消费者的权利。然而，1998年至2000年期间制定的《美国~~-欧盟安全港原则》（U.S.-E.U.Safe Harbor Principles）目前有效地使欧洲用户受到美国公司的隐私剥削。由于爱德华·斯诺登 Edward ~~Snowden披露了全球监控信息后，关于撤销这一协议的讨论越来越多，尤其是数据将完全暴露给国家安全局，与美国达成协议的尝试也失败了。~~

+

欧洲有相当严密的隐私法，正在努力进一步加强消费者的权利。然而，1998年至2000年期间制定的《美国-欧盟安全港原则》（U.S.-E.U.Safe Harbor Principles）目前有效地使欧洲用户受到美国公司的隐私剥削。由于爱德华·斯诺登 Edward Snowden披露了全球监控信息后，关于撤销这一协议的讨论越来越多，讨论的话题主要关于把数据完全暴露给国家安全局，与美国达成协议的尝试失败这些事上。

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】"目前有效地使欧洲用户受到美国公司的隐私剥削"一句改为"在当下让欧洲用户的隐私泄露给美国公司以利用”

===美国的情况 Situation in the United States===

第414行：第445行：

In the United States, privacy concerns have been addressed by the US Congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in Biotech Business Week, "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals." This underscores the necessity for data anonymity in data aggregation and mining practices.

−

在美国，美国国会通过了《健康保险便携性和责任法案》（HIPAA）等监管措施，解决了隐私问题。HIPAA要求个人就其提供的信息及其当前和未来的预期用途给予“知情同意”。根据《生物技术商业周刊》的一篇文章，“在实践中，HIPAA可能不会比研究领域长期存在的法规提供更大的保护。”。更重要的是，该规则通过知情同意进行保护的目标是接近普通个人的不可理解程度。”这突出了数据聚合和挖掘实践中数据匿名的必要性。

+

在美国，美国国会通过了《健康保险便携性和责任法案》（HIPAA）等监管措施解决了隐私问题。HIPAA要求个人就其提供的信息及其当前和未来的预期用途给予“知情同意”。根据《生物技术商业周刊》的一篇文章，“实际上在研究领域HIPAA可能不会比长期存在的法规提供更好的保护。”。更重要的是，该规则通过知情同意进行保护的目标是接近普通个人的不可理解程度。”这突出了数据聚合和挖掘实践中数据匿名的必要性。

第435行：第466行：

Under European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is not legal. Where a database is pure data in Europe, it may be that there is no copyright but database rights may exist so data mining becomes subject to intellectual property owners' rights that are protected by the Database Directive. On the recommendation of the Hargreaves review, this led to the UK government to amend its copyright law in 2014 to allow content mining as a limitation and exception. The UK was the second country in the world to do so after Japan, which introduced an exception in 2009 for data mining. However, due to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law also does not allow this provision to be overridden by contractual terms and conditions.

−

根据欧洲版权法和数据库法，未经版权所有人许可而对版权作品进行挖掘（如通过网络挖掘）是不合法的。在欧洲，如果数据库是纯数据，可能没有版权，但数据库权利可能存在，因此数据挖掘受数据库指令保护的知识产权所有者的权利约束。根据《哈格里夫斯评论》（Hargreaves review）的建议，这导致英国政府在2014年修订了版权法，允许将内容挖掘作为一种限制和例外。英国是继日本之后世界上第二个这样做的国家，日本在2009年引入了数据挖掘的例外。然而，由于信息社会指令（2001年）的限制，英国是例外情况但是只允许非商业目的的内容挖掘。英国版权法也不允许合同条款和条件推翻这一规定。

+

根据欧洲版权法和数据库法，未经版权所有人许可而对版权作品进行挖掘（如通过网络挖掘）是不合法的。在欧洲，如果数据库是纯数据，可能没有版权，但数据库权利可能存在，因此数据挖掘受数据库指令保护的知识产权所有者的权利约束。《哈格里夫斯评论》（Hargreaves review）指出，这使得英国政府在2014年修订了版权法，允许将内容挖掘作为一种限制和例外。英国是继日本之后世界上第二个这样做的国家，日本在2009年把数据挖掘作为一个特例。然而，由于信息社会指令（2001年）的限制，英国是例外情况只允许非商业目的的内容挖掘。英国版权法也不允许合同条款和条件推翻这一规定。

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】将“英国是例外情况但是只允许给商业目的的内容挖掘”改为“英国对于内容挖掘的例外只允许非商业目的的内容挖掘”

第445行：第476行：

2013年，欧盟委员会以“欧洲许可证”为题，推动了利益相关者对文本和数据挖掘的讨论。但他们将重点放在解决这一法律问题上，如许可证而不是限制和例外，导致大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表于2013年5月离开了利益相关者对话。

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）【审校】"如许可证而不是限制和例外，导致大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表于2013年5月离开了利益相关者对话。

+

"改为“比如如何许可它而不是如何限制它或者把它作为一个例外，这使得大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表等利益相关者于2013年5月结束了讨论。”

===美国 Situation in the United States===

Thingamabob

143

个编辑

更改

数据挖掘 (查看源代码)

2020年9月14日 (一) 00:23的版本

导航菜单

搜索