更改

数据挖掘 (查看源代码)

2020年9月13日 (日) 22:41的版本

添加3,083字节、 2020年9月13日 (日) 22:41

无编辑摘要

第9行：第9行：

Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is the analysis step of the "knowledge discovery in databases" process or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

−

'''数据挖掘 Data Mining '''是在大型数据集中发现模式的过程，是一种涉及到机器学习、统计学和数据库系统综合使用的方法。数据挖掘是指“在数据库中发现知识(knowledge discovery in databases,KDD)”过程中的分析步骤。除了传统的分析步骤，它还涉及数据库和数据管理方面，包括数据预处理、模型和推理考虑、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。

+

'''数据挖掘 Data Mining '''是在大型数据集中发现模式的过程，是一种涉及到机器学习、统计学和数据库系统综合使用的方法。数据挖掘是指“在数据库中知识发现KDD”过程中的分析步骤。除了传统的分析步骤，它还涉及数据库和数据管理方面，包括数据预处理、模型和推理考虑、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。

−

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]） ~~【审校】“数据挖掘是指“在数据库中知识发现KDD”的过程的分析步骤”一句中的“在数据库中知识发现KDD”处改为“在数据库中发现知识~~(knowledge discovery in databases,KDD)”

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“数据挖掘是指“数据库中的知识发现KDD”的过程的分析步骤”一句中的“在数据库中知识发现KDD”处改为“数据库中的知识发现(knowledge discovery in databases,KDD)”

第18行：第18行：

The term "data mining" is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.

−

“数据挖掘”这种形容其实并不十分恰当，因为我们的目标是从大量数据中提取模式和知识，而不是数据本身的提取(挖掘)。它是一个流行语，经常用于任何形式的大规模数据或信息处理（收集、提取、仓储、分析和统计）的场景下，以及''' 计算机决策系统 Decision Support System，DSS'''的任何应用当中，包括人工智能（例如机器学习）和商业智能。《数据挖掘：使用Java的实用机器学习工具和技术》（主要涵盖机器学习材料）一书最初被命名为“实用机器学习”，而数据挖掘一词只是为了营销的原因而增加。经常来说，更一般的术语如（大规模）数据分析，或实际的方法如人工智能和机器学习，是更合适的表达方式。

+

“数据挖掘”这种形容其实并不十分恰当，因为我们的目标是从大量数据中提取模式和知识，而不是数据本身的提取(挖掘)。它是一个流行语，经常用于任何形式的大规模数据或信息处理（收集、提取、仓储、分析和统计）的场景下，以及''' 计算机决策系统 Decision Support System，DSS'''的任何应用当中，包括人工智能（例如机器学习）和商业智能。《数据挖掘：使用Java的实用机器学习工具和技术》（主要涵盖机器学习材料）一书最初被命名为“实用机器学习”，而数据挖掘一词只是为了营销的原因而增加。经常更一般的术语例如（大规模）数据分析和分析——或当提到实际的方法时使用人工智能和机器学习这样的词语更加合适。

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“经常更一般的术语例如（大规模）数据分析和分析——或当提到实际的方法时使用人工智能和机器学习这样的词语更加合适”一句改为“经常来说，更一般的术语如（大规模）数据分析，或实际的方法如人工智能和机器学习，是更合适的表达方式”

第44行：第44行：

相关术语'''“数据疏浚” Data Dredging'''、“数据钓鱼”和“数据窥探”是指使用数据挖掘方法对较大的人口数据集中的一部分进行抽样，这些数据集太小（或可能太小），无法对所发现的任何模式的有效性作出可靠的统计推断。但是，这些方法可以用于提出新的假设，以针对更大的数据群体进行测试。

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“使用数据挖掘方法对较大的人口数据集中的一部分进行抽样”中的“较大的人口数据集”改为“较大规模的数据集”

==起源 Etymology==

第61行：第62行：

数据挖掘这个术语在1990年左右出现在数据库领域，通常有着积极的内涵。在20世纪80年代的一段短暂时间里，人们曾使用过“数据库挖掘”这种表达，但由于这个词被总部位于圣地亚哥的 HNC 公司注册为商标，因此研究人员转向了数据挖掘。曾用过的其他术语包括数据考古学、信息收集、信息发现、知识提取等。格雷戈里·皮亚特斯基·夏皮罗 Gregory Piatetsky-Shapiro 在关于这个主题的第一个研讨会[ http://www.kdnuggets.com/meetings/kdd89/ (KDD-1989)] 上首次创造了“数据库中的知识发现 Knowledge Discovery in Databases，KDD”这个术语。此后，这个术语在人工智能和机器学习社区中变得更加流行。然而，数据挖掘这个术语在商业和出版界变得越来越流行。目前，数据挖掘和知识发现 knowledge discovery这两个术语可以互换使用。

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“但由于这个词被总部位于圣地亚哥的 HNC 公司注册为商标”中的“总部位于圣地亚哥的HNC公司”改为“圣地亚哥的HNC公司”

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“这个术语在人工智能和机器学习社区中变得更加流行”中的“社区”改为“群体”

In the academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery ([[KDD-95]]) was started in Montreal under [[AAAI]] sponsorship. It was co-chaired by [[Usama Fayyad]] and Ramasamy Uthurusamy. A year later, in 1996, Usama Fayyad launched the journal by Kluwer called [[Data Mining and Knowledge Discovery]] as its founding editor-in-chief. Later he started the [[SIGKDD]] Newsletter SIGKDD Explorations.<ref name=SIGKDD-explorations>{{cite journal|last1=Fayyad|first1=Usama|title=First Editorial by Editor-in-Chief|journal=SIGKDD Explorations|date=15 June 1999|volume=13|issue=1|pages=102|doi=10.1145/2207243.2207269|url=http://www.kdd.org/explorations/view/june-1999-volume-1-issue-1|accessdate=27 December 2010|ref=SIGKDD-explorations}}</ref> The KDD International conference became the primary highest quality conference in data mining with an acceptance rate of research paper submissions below 18%. The journal ''Data Mining and Knowledge Discovery'' is the primary research journal of the field.

第76行：第80行：

从数据中手动提取模式的方法已经持续了好几个世纪了。早期识别数据模式的方法包括17世纪的'''贝叶斯定理 Bayes' Theorem'''和19世纪的'''回归分析 Regression Analysis'''。计算机技术的扩散、其普遍性和日益强大的能力极大地提高了数据的收集、存储和操作能力。随着数据集的规模和复杂性的增长，手动分析数据的方法越来越多地被更强的间接、自动化的数据处理所取代，这都得益于计算机科学其他领域取得的新的进步，特别是机器学习领域的'''神经网络 Neural Networks'''、'''聚类分析 Cluster Analysis'''、'''遗传算法 Genetic Algorithms'''（1950年代），'''决策树 Decision Tree'''和'''决策规则 Decision Rules'''（1960年代）以及'''支持向量机 Support Vector Machines'''（1990年代）等。数据挖掘就是应用这些方法来发现大型数据集中的隐藏模式的过程。它利用数据在数据库中存储和索引的方式，更有效地执行实际的学习和发现算法，从而弥补了从应用统计学和人工智能(通常提供数学背景)到数据库管理之间的差距，使这些方法能够应用于更大的数据集。

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“手动分析数据的方法越来越多地被更强的间接、自动化的数据处理所取代”中的“手动分析数据”改为“直接、手动的分析数据”

==发展过程 Process==

第177行：第183行：

在这些调查中，唯一的其他数据挖掘标准是SEMMA。然而，使用CRISP-DM的人数是其3-4倍。一些研究小组已经发表了关于数据挖掘过程模型的研究，例如阿泽维多 Azevedo和桑托斯Santos曾在2008年对CRISP-DM和SEMMA这两套数据挖掘流程标准进行了比较。

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】开头添加“2002、2004、2007、2014年的调查显示，CRISP-DM标准是数据挖掘者最常用的标准”

===预处理 Pre-processing===

第201行：第209行：

'''异常检测 Anomaly detection'''（异常值/变化/偏差检测）：识别异常数据记录，发现可能是有趣的或需要进一步调查的数据错误。

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“发现可能是有趣的或需要进一步调查的数据错误”改为“这可能是有趣的信息或需要进一步调查的数据错误”

* [[Association rule learning]] (dependency modeling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.

第213行：第223行：

'''分类 Classification'''：是将已知结构归纳为新数据的任务。例如，电子邮件程序可能会尝试将电子邮件分类为“合法”或“垃圾邮件”。

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“是将已知结构归纳为新数据的任务”改为“是归纳已知结构并应用于新数据的任务”

* [[Regression analysis|Regression]] – attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets.

第221行：第233行：

'''自动文摘 Automatic summarizatio'''：提供数据集更紧凑、简洁的表示，包括可视化和报告生成。

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】“自动文摘 Automatic summarizatio”改为“总结 Summarization”

===结果验证 Results validation===

第293行：第307行：

为数据挖掘过程定义了一些标准，例如1999年欧洲跨行业数据挖掘标准流程（CRISP-DM 1.0）和2004年Java数据挖掘标准（JDM 1.0）。这些程序的后续程序（CRISP-DM 2.0和 JDM 2.0）的开发活跃于2006年，但此后一直停滞不前。Jdm 2.0没有达成最终草案就被撤销了。

−

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】将“为数据挖掘过程定义了一些标准”改为“人们曾努力为数据挖掘过程定义标准

For exchanging the extracted models – in particular for use in [[predictive analytics]] – the key standard is the [[Predictive Model Markup Language]] (PMML), which is an [[XML]]-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) [[subspace clustering]] have been proposed independently of the DMG.<ref>{{Cite book | last1 = Günnemann | first1 = Stephan | last2 = Kremer | first2 = Hardy | last3 = Seidl | first3 = Thomas | doi = 10.1145/2023598.2023605 | chapter = An extension of the PMML standard to subspace clustering models | title = Proceedings of the 2011 workshop on Predictive markup language modeling - PMML '11 | pages = 48 | year = 2011 | isbn = 978-1-4503-0837-3 | pmid = | pmc = }}</ref>

第337行：第351行：

数据挖掘需要进行数据准备，以发现损害机密性和隐私义务的信息或模式。实现这一点的一种常见方式是通过'''数据聚合 Data Aggregation'''。数据聚合包括以一种便于分析的方式将数据（可能来自不同的来源）组合在一起（但这也可能使私人、个人级别的数据的识别变得可推断或以其他方式显而易见）。但这并不是数据挖掘本身，而是在分析之前以及为分析目的准备数据的结果。当数据被编译后，数据挖掘者或任何有权访问新编译的数据集的人能够识别特定的个人，特别是当数据最初是匿名的时，对个人隐私的威胁就开始发挥作用了。

−

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】将“对个人隐私的威胁就开始发挥作用了”改为“就会产生对个人隐私的威胁”

It is recommended{{whom|date=August 2019}} to be aware of the following '''before''' data are collected:<ref name="NASCIO" />

第371行：第385行：

数据也可以被修改成匿名的，这样个人就不容易被修改了确定。但是，甚至“匿名化”的数据集也可能包含足够的信息用来识别个人，就像记者能够根据一组无意中搜索历史找到几个个人一样美国在线发布。

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】将“这样个人就不容易被修改了确定”改为“这样个人就不会轻易的被识别”

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】将“就像记者能够根据一组无意中搜索历史找到几个个人一样美国在线发布”改为“就像记者能够依据‘美国在线’无意中发布的用户历史记录找到一些个人”

The inadvertent revelation of [[personally identifiable information]] leading to the provider violates Fair Information Practices. This indiscretion can cause financial, emotional, or bodily harm to the indicated individual. In one instance of [[privacy violation]], the patrons of Walgreens filed a lawsuit against the company in 2011 for selling prescription information to data mining companies who in turn provided the data to pharmaceutical companies.<ref>{{Cite journal|title = Big data׳s impact on privacy, security and consumer welfare|journal = Telecommunications Policy|pages = 1134–1145|volume = 38|issue = 11|doi = 10.1016/j.telpol.2014.10.002|first = Nir|last = Kshetri|year = 2014|url = http://libres.uncg.edu/ir/uncg/f/N_Kshetri_Big_2014.pdf}}</ref>

第392行：第409行：

===美国的情况 Situation in the United States===

−

In the United States, privacy concerns have been addressed by the [[US Congress]] via the passage of regulatory controls such as the [[Health Insurance Portability and Accountability Act]] (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in ''Biotech Business Week'', "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals."<ref>Biotech Business Week Editors (June 30, 2008); ''BIOMEDICINE; HIPAA Privacy Rule Impedes Biomedical Research'', Biotech Business Week, retrieved 17 November 2009 from LexisNexis Academic</ref> This underscores the necessity for data anonymity in data aggregation and mining practices.

第422行：第437行：

根据欧洲版权法和数据库法，未经版权所有人许可而对版权作品进行挖掘（如通过网络挖掘）是不合法的。在欧洲，如果数据库是纯数据，可能没有版权，但数据库权利可能存在，因此数据挖掘受数据库指令保护的知识产权所有者的权利约束。根据《哈格里夫斯评论》（Hargreaves review）的建议，这导致英国政府在2014年修订了版权法，允许将内容挖掘作为一种限制和例外。英国是继日本之后世界上第二个这样做的国家，日本在2009年引入了数据挖掘的例外。然而，由于信息社会指令（2001年）的限制，英国是例外情况但是只允许非商业目的的内容挖掘。英国版权法也不允许合同条款和条件推翻这一规定。

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】将“英国是例外情况但是只允许给商业目的的内容挖掘”改为“英国对于内容挖掘的例外只允许非商业目的的内容挖掘”

The [[European Commission]] facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe.<ref>{{cite web|title=Licences for Europe - Structured Stakeholder Dialogue 2013|url=http://ec.europa.eu/licences-for-europe-dialogue/en/content/about-site|website=European Commission|accessdate=14 November 2014}}</ref> The focus on the solution to this legal issue, such as licensing rather than limitations and exceptions, led to representatives of universities, researchers, libraries, civil society groups and [[open access]] publishers to leave the stakeholder dialogue in May 2013.<ref>{{cite web|title=Text and Data Mining:Its importance and the need for change in Europe|url=http://libereurope.eu/news/text-and-data-mining-its-importance-and-the-need-for-change-in-europe/|website=Association of European Research Libraries|accessdate=14 November 2014}}</ref>

第439行：第455行：

美国版权法，特别是其中关于合理使用的条款，支持在美国和其他合理使用国家，如以色列，台湾和韩国采矿内容的合法性。由于内容挖掘是变革性的，也就是说，它不会取代原来的工作，它被视为合法的合理使用。例如，作为谷歌图书和解协议的一部分，此案的主审法官裁定，谷歌版权图书数字化项目是合法的，部分原因在于数字化项目所展示的变革性用途——其中之一就是文本和数据挖掘。

+

--[[用户:Zengsihang|Zengsihang]]（[[用户讨论:Zengsihang|讨论]]）【审校】将“台湾和韩国采矿内容的合法性”改为“台湾和韩国内容挖掘的合法性”

==软件 Software==

Zengsihang

12

个编辑

更改

数据挖掘 (查看源代码)

2020年9月13日 (日) 22:41的版本

导航菜单

搜索