数据挖掘

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
不是海绵宝宝讨论 | 贡献2020年9月17日 (四) 12:18的版本
跳到导航 跳到搜索

此词条由许菁翻译整理。 此词条由Zengsihang审校。 模板:Machine learning bar

数据挖掘是在大型数据集中发现模式的过程,是一种涉及到机器学习、统计学和数据库系统综合使用的方法。[1][2][3][4]数据挖掘是指“在数据库中知识发现KDD”过程中的分析步骤。除了传统的分析步骤,它还涉及数据库和数据管理方面,包括数据预处理、模型和推理考虑、兴趣权值考量、复杂性考量、发现结构的后处理、可视化和在线更新等内容。

 --Zengsihang讨论) 【审校】“数据挖掘是指“数据库中的知识发现KDD”的过程的分析步骤”一句中的“在数据库中知识发现KDD”处改为“知识发现(knowledge discovery in databases,KDD)”

 --Thingamabob讨论) 【审校】 "是在大型数据集中发现模式的过程,是一种涉及到机器学习、统计学和数据库系统综合使用的方法。"一句改为“是一种在大型数据集中发现模式的过程,用到了机器学习、统计学和数据库系统的交叉方法。”
 --Thingamabob讨论) 【审校】 “数据预处理、模型和推理考虑、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。”一句改为“数据预处理、建模和推理考量、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。”


The term "data mining" is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself.[5] It also is a buzzword[6] and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java[7] (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons.[8] Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.

The term "data mining" is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.

“数据挖掘”这种形容其实并不十分恰当,因为我们的目标是从大量数据中提取模式和知识,而不是数据本身的提取(挖掘)。它是一个流行语,经常用于任何形式的大规模数据或信息处理(收集、提取、仓储、分析和统计)的场景下,以及 计算机决策系统 Decision Support System,DSS的任何应用当中,包括人工智能(例如机器学习)和商业智能。《数据挖掘:使用Java的实用机器学习工具和技术》(主要涵盖机器学习材料)一书最初被命名为《实用机器学习》,而数据挖掘一词只是为了营销的原因而增加。经常更一般的术语例如(大规模)数据分析和分析——或当提到实际的方法时使用人工智能和机器学习这样的词语更加合适。

 --Zengsihang讨论) 【审校】“经常更一般的术语例如(大规模)数据分析和分析——或当提到实际的方法时使用人工智能和机器学习这样的词语更加合适”一句改为“经常来说,更一般的术语如(大规模)数据分析,或实际的方法如人工智能和机器学习,是更合适的表达方式”
 --Thingamabob讨论) 【审校】“'数据挖掘'这种形容其实并不十分恰当”一句改为““数据挖掘”这种形容其实并不恰当”
 --Thingamabob讨论) 【审校】“它是一个流行语,经常用于任何形式的大规模数据或信息处理(收集、提取、仓储、分析和统计)的场景下,以及 计算机决策系统 Decision Support System,DSS的任何应用当中,包括人工智能(例如机器学习)和商业智能。”一句改为“它是一个经常被用于各种大规模数据或信息处理(收集、提取、存储、分析和统计),以及包括人工智能(例如机器学习)和商业智能的 计算机决策系统 Decision Support System,DSS等场合的流行语”
 --Thingamabob讨论) 【审校】“(主要涵盖机器学习材料)”一句改为“主要提供了一些机器学习的资料”
 --Thingamabob讨论) 【审校】“而数据挖掘一词只是为了营销的原因而增加”改为“而数据挖掘一词只是为了销量更好而增加的”

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.

实际的数据挖掘任务是对大量数据进行半自动或全自动分析,以提取出从前未知的且有趣的模式,如数据记录组(数据聚类)、异常记录组(异常检测)和依赖关系(关联规则挖掘,序列挖掘)。这通常涉及使用数据库技术,如空间索引。这些模式可以被看作是输入数据的一种汇总,并且可以用于进一步的分析,例如机器学习和预测分析。例如,数据挖掘步骤可以识别数据中的多个组,然后可以使用该步骤通过决策支持系统获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分,而是作为附加步骤属于整个 KDD 过程。

如数据记录组(聚类分析 Cluster Analysis)、异常记录(异常检测 Anomaly Detection)和依赖关系(关联规则挖掘 Association Rule Mining序列模式挖掘 Sequential Pattern Mining)。这通常涉及到使用数据库技术,如空间索引。这些模式可以被看作是输入数据的一种规律总结,可以用于进一步的分析,或者,例如,在机器学习和预测分析中。例如,通过数据挖掘可以出识别数据中的多个组,然后这些组可以通过使用决策支持系统来获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分,而是整个KDD过程附加的步骤。

 --Thingamabob讨论) 【审校】 “以提取出从前未知的且有趣的模式”改为“以发掘从前未知的且新奇的模式”
 --Thingamabob讨论) 【审校】“ 例如,数据挖掘步骤可以识别数据中的多个组”改为“例如数据挖掘的过程中可以把数据分成多个组”

The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.[9]

The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.

数据分析 Data Analysis和数据挖掘的区别在于,数据分析用于测试数据集上的模型和假设,例如,分析营销活动的有效性,而不是考虑数据量的多少;相反,数据挖掘使用机器学习和统计模型来发现“大量”数据中的秘密和隐藏的模式。


The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

相关术语“数据疏浚” Data Dredging、“数据钓鱼”和“数据窥探”是指使用数据挖掘的方法对较大的人口数据集中的一部分进行抽样,这些数据集可能太小,无法对所发现的任何模式的有效性作出可靠的统计推断。但是,这些方法可以用于提出新的假设,以针对更大的数据群体进行测试。

 --Zengsihang讨论) 【审校】“使用数据挖掘方法对较大的人口数据集中的一部分进行抽样”中的“较大的人口数据集”改为“较大规模的数据集”
 --Thingamabob讨论) 【审校】“无法对所发现的任何模式的有效性作出可靠的统计推断”改为“无法可靠统计推断发现模式的有效性”


起源 Etymology

In the 1960s, statisticians and economists used terms like data fishing or data dredging to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term "data mining" was used in a similarly critical way by economist Michael Lovell in an article published in the Review of Economic Studies in 1983.[10][11] Lovell indicates that the practice "masquerades under a variety of aliases, ranging from "experimentation" (positive) to "fishing" or "snooping" (negative).

In the 1960s, statisticians and economists used terms like data fishing or data dredging to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term "data mining" was used in a similarly critical way by economist Michael Lovell in an article published in the Review of Economic Studies in 1983. Lovell indicates that the practice "masquerades under a variety of aliases, ranging from "experimentation" (positive) to "fishing" or "snooping" (negative).

在20世纪60年代,统计学家和经济学家们曾经使用“数据钓鱼”或”数据疏浚“等术语来指代他们认为在没有先验假设的情况下进行数据分析的糟糕做法。经济学家迈克尔•洛弗尔 Michael Lovell 在1983年《经济研究评论》(Review of Economic Studies)上发表的一篇文章中,也以类似的批判方式使用了“数据挖掘”这个术语。Lovell 指出,这种做法“伪装成各种别名,从“实验”(正面)到“钓鱼”或“窥探”(负面)。

 --Thingamabob讨论) 【审校】“这种做法“伪装成各种别名,从“实验”(正面)到“钓鱼”或“窥探”(负面)。”改为“这种做法有很多别名,比如正面说法"实验",负面说法“钓鱼”、“窥探”等。

The term data mining appeared around 1990 in the database community, generally with positive connotations. For a short time in 1980s, a phrase "database mining"™, was used, but since it was trademarked by HNC, a San Diego-based company, to pitch their Database Mining Workstation;[12] researchers consequently turned to data mining. Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities.[13] Currently, the terms data mining and knowledge discovery are used interchangeably.

The term data mining appeared around 1990 in the database community, generally with positive connotations. For a short time in 1980s, a phrase "database mining"™, was used, but since it was trademarked by HNC, a San Diego-based company, to pitch their Database Mining Workstation; researchers consequently turned to data mining. Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities. Currently, the terms data mining and knowledge discovery are used interchangeably.

数据挖掘这个术语在1990年左右出现在数据库领域,通常有着积极的内涵。在20世纪80年代的一段短暂的时间里,人们曾使用过“数据库挖掘”这种表达,但由于这个词被总部位于圣地亚哥的 HNC 公司注册为商标,因此研究人员转向了数据挖掘。曾用过的其他术语包括数据考古学、信息收集、信息发现、知识提取等。格雷戈里·皮亚特斯基·夏皮罗 Gregory Piatetsky-Shapiro 在关于这个主题的第一个研讨会[ http://www.kdnuggets.com/meetings/kdd89/ (KDD-1989)] 上首次提出了“数据库中的知识发现 Knowledge Discovery in Databases,KDD”这个术语。此后,这个术语在人工智能和机器学习领域中变得更加流行。然而,数据挖掘这个术语在商业和出版界变得越来越流行。目前,数据挖掘和知识发现 knowledge discovery这两个术语可以互换使用。

 --Zengsihang讨论) 【审校】“但由于这个词被总部位于圣地亚哥的 HNC 公司注册为商标”中的“总部位于圣地亚哥的HNC公司”改为“圣地亚哥的HNC公司”
 --Zengsihang讨论) 【审校】“这个术语在人工智能和机器学习社区中变得更加流行”中的“社区”改为“群体”
 --Thingamabob讨论) 【审校】“数据挖掘这个术语在1990年左右出现在数据库领域,通常有着积极的内涵。”一句改为“数据挖掘这个术语在1990年左右在数据库领域出现,通常有着积极的含义"
 --Thingamabob讨论) 【审校】“因此研究人员转向了数据挖掘”改为“因此研究人员改用了数据挖掘这个词”

In the academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery (KDD-95) was started in Montreal under AAAI sponsorship. It was co-chaired by Usama Fayyad and Ramasamy Uthurusamy. A year later, in 1996, Usama Fayyad launched the journal by Kluwer called Data Mining and Knowledge Discovery as its founding editor-in-chief. Later he started the SIGKDD Newsletter SIGKDD Explorations.[14] The KDD International conference became the primary highest quality conference in data mining with an acceptance rate of research paper submissions below 18%. The journal Data Mining and Knowledge Discovery is the primary research journal of the field.


In the academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery (KDD-95) was started in Montreal under AAAI sponsorship. It was co-chaired by Usama Fayyad and Ramasamy Uthurusamy. A year later, in 1996, Usama Fayyad launched the journal by Kluwer called Data Mining and Knowledge Discovery as its founding editor-in-chief. Later he started the SIGKDD Newsletter SIGKDD Explorations.The KDD International conference became the primary highest quality conference in data mining with an acceptance rate of research paper submissions below 18%. The journal Data Mining and Knowledge Discovery is the primary research journal of the field.

学术界主要的研究论坛始于1995年,当时,在AAAI的赞助下,第一届数据挖掘和知识发现国际会议(KDD-95)在蒙特利尔召开。会议由乌萨马·法耶兹 Usama Fayyad和拉玛萨米·乌图鲁萨米 Ramasamy Uthurusamy共同主持。一年后,1996年Usama Fayyad创办了杂志《数据挖掘与知识发现》(datamining and Knowledge Discovery),担任创始主编。后来他创办了SIGKDD时事通讯探索。KDD国际会议也成为了数据挖掘领域质量最高的主要会议,其研究论文提交的接受率低于18%,而《数据挖掘与知识发现》也成为了该领域的主要研究期刊。


背景 Background

The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology have dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, specially in the field of machine learning, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns[15] in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever-larger data sets.

The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology have dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, specially in the field of machine learning, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever-larger data sets.

从数据中手动提取模式的方法已经持续了好几个世纪了。早期识别数据模式的方法包括17世纪的贝叶斯定理 Bayes' Theorem和19世纪的回归分析 Regression Analysis。计算机技术的扩散、其普遍性和日益强大的能力极大地提高了数据的收集、存储和操作能力。随着数据集的规模和复杂性的增长,手动分析数据的方法越来越多地被更有力的间接、自动化的数据处理所取代,这都得益于计算机科学其他领域取得的新的进步,特别是机器学习领域的神经网络 Neural Networks聚类分析 Cluster Analysis遗传算法 Genetic Algorithms(1950年代),决策树 Decision Tree决策规则 Decision Rules(1960年代)以及支持向量机 Support Vector Machines(1990年代)等。数据挖掘就是应用这些方法来发现大型数据集中的隐藏模式的过程。它利用数据在数据库中存储和索引的方式,更有效地执行实际的学习和发现算法,从而弥补了从应用统计学和人工智能(通常提供数学背景)到数据库管理之间的差距,使这些方法能够应用于更大的数据集。

 --Zengsihang讨论) 【审校】“手动分析数据的方法越来越多地被更强的间接、自动化的数据处理所取代”中的“手动分析数据”改为“直接、手动的分析数据”
 --Thingamabob讨论) 【审校】“计算机技术的扩散、其普遍性和日益强大的能力”改为“计算机技术的广泛使用和其能力的日益提高

发展过程 Process

The knowledge discovery in databases (KDD) process is commonly defined with the stages:

The knowledge discovery in databases (KDD) process is commonly defined with the stages:

知识发现 Knowledge Discovery in Databases ,KDD过程通常定义为以下几个阶段:


  1. Selection
Selection

选择

  1. Pre-processing
Pre-processing

预处理

  1. Transformation
Transformation

转换

  1. Data mining
Data mining

数据挖掘

  1. Interpretation/evaluation.[16]
Interpretation/evaluation.

解释 / 评估。


It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:

It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:

知识发现还存在于与这个主题相关的其他主题中,例如在数据挖掘的跨行业标准流程 Cross-industry standard process for data mining,CRISP-DM中它定义了以下六个阶段:


  1. Business understanding
Business understanding

商业理解

  1. Data understanding
Data understanding

数据理解

  1. Data preparation
Data preparation

数据准备

  1. Modeling
Modeling

建模

  1. Evaluation
Evaluation

评估

  1. Deployment
Deployment

部署


or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.

or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.

或一个简化的过程,包括:(1)预处理,(2)数据挖掘,(3)结果验证。


Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners.[17] The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models,[18] and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.[19]

Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners. The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models, and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.

在这些调查中,唯一使用的其他数据挖掘标准是SEMMA。然而,使用CRISP-DM的人数是其3-4倍。一些研究小组已经发表了关于数据挖掘过程模型的研究,例如阿泽维多 Azevedo和 桑托斯Santos曾在2008年对CRISP-DM和SEMMA这两套数据挖掘流程标准进行了比较。

 --Zengsihang讨论) 【审校】开头添加“2002、2004、2007、2014年的调查显示,CRISP-DM标准是数据挖掘者最常用的标准”

预处理 Pre-processing

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.

在使用数据挖掘算法之前,必须先对目标数据集进行整合。由于数据挖掘只能发现数据中实际存在的模式,目标数据集必须足够大以包含这些模式,同时保持足够简洁以便在可接受的时间限制内进行挖掘。数据的公共源是数据集市或数据仓库。在数据挖掘之前,对多变量数据集进行预处理是必不可少的。然后清理目标集。数据清理去除了包含噪声的观测值和缺失数据的观测值。

在使用数据挖掘算法之前,必须组装目标数据集。由于数据挖掘只能发现数据中实际存在的模式,因此目标数据集必须足够大以包含这些模式,同时保持足够简洁,以便在可接受的时间限制内进行挖掘。数据的常见来源是数据集市 Data Mart数据仓库 Data Warehouse。在数据挖掘之前,对多元 Multivariate数据集进行预处理是必不可少的,然后对目标集进行清洗。数据清洗将删除包含噪声 Noise的观测值和缺失数据 Missing Data的观测值。

数据挖掘 Data mining

Data mining involves six common classes of tasks:[16]

Data mining involves six common classes of tasks:

数据挖掘涉及六类常见的任务:


  • Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.

异常检测 Anomaly detection(异常值/变化/偏差检测):识别异常数据记录,发现可能是有趣的或需要进一步调查的数据错误。

 --Zengsihang讨论) 【审校】“发现可能是有趣的或需要进一步调查的数据错误”改为“这可能是有趣的信息或需要进一步调查的数据错误”
  • Association rule learning (dependency modeling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.

关联规则学习 Association rule learning(依赖关系建模):探寻变量之间的关系。例如,超市可能会收集顾客购买习惯的数据。通过使用关联规则学习,超市可以确定哪些产品经常被一起购买,并将这些信息用于营销策略改进。这种研究有时被称为“市场篮子分析”。

  • Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.

聚类 Clustering:是指在数据中发现以某种方式或其他方式“相似”的组和结构,而不使用数据中已知的结构。

  • Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".

分类 Classification:是将已知结构归纳为新数据的任务。例如,电子邮件程序可能会尝试将电子邮件分类为“合法”或“垃圾邮件”。

 --Zengsihang讨论) 【审校】“是将已知结构归纳为新数据的任务”改为“是归纳已知结构并应用于新数据的任务”
  • Regression – attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets.

回归:试图找到一个对数据建模误差最小的函数,也就是说,用于估计数据或数据集之间的关系。

  • Summarization – providing a more compact representation of the data set, including visualization and report generation.

自动文摘 Automatic summarizatio:提供数据集更紧凑、简洁的表示,包括可视化和报告生成。

 --Zengsihang讨论) 【审校】“自动文摘 Automatic summarizatio”改为“总结 Summarization”

结果验证 Results validation

文件:Spurious correlations - spelling bee spiders.svg
An example of data produced by data dredging through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence. 一个由统计学家泰勒·维根 Tyler Vigen操作的机器人进行数据挖掘所产生的数据,显然表明在拼字比赛中获胜的最佳单词与美国被毒蜘蛛杀死的人数之间有着密切的联系。但是显然这种趋势上的相似仅仅是一个巧合。

An example of data produced by data dredging through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence.

一个由统计学家泰勒·维根 Tyler Vigen操作的机器人进行数据挖掘所产生的数据,显然表明在拼字比赛中获胜的最佳单词与美国被毒蜘蛛杀死的人数之间有着密切的联系。但是显然这种趋势上的相似仅仅是一个巧合。

Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.[20]

Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.

数据挖掘可能会在无意中被误用,然后产生看似重要的结果; 但这些结果实际上并不能用来预测未来的行为,也不能在新的数据样本上进行复现,而且用处不大。这通常是由于做出太多的假设,而没有进行适当的统计假设检验 Statistical Hypothesis Testing。在机器学习中,这种问题可以被简称为过拟合 Overfitting,但相同的问题可能会在过程的不同阶段出现,因此哪怕在完全适用的情况下,合理进行训练/测试分割这一种方法也可能不足以防止这种情况的发生。

模板:Missing information

The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as ROC curves.

The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as ROC curves.

从数据中发现知识的最后一步是验证数据挖掘算法产生的模式是否存在于更广泛的数据集中。数据挖掘算法发现的并非所有模式都是有效的,因为对于数据挖掘算法来说,在训练集中发现一般数据集中没有的模式是很常见的,这叫做过拟合 Overfitting。为了克服这个问题,评估使用一组测试数据,而数据挖掘算法并没有在这些测试数据上进行训练。然后将学习到的模式应用到这个测试集 Test Set中,并将结果输出与期望的输出进行比较。例如,试图区分“垃圾邮件”和“合法”邮件的数据挖掘算法将根据一组电子邮件训练集 Training Sett样本进行训练。训练完毕后,学到的模式将应用于未经训练的那部分电子邮件测试集数据上。然后,可以从这些模式正确分类的电子邮件数量来衡量这些模式的准确性。可以使用几种统计方法可以用来评估算法,如ROC 曲线 ROC curves

 --Thingamabob讨论) 【审校】“为了克服这个问题,评估使用一组测试数据,而数据挖掘算法并没有在这些测试数据上进行训练”改为“为了解决这个问题,评估时会使用一组没有用在训练数据挖掘算法中用到的测试数据”

If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.

If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.

如果学习的模式不能达到预期的标准,那么就需要重新评估和修改预处理和数据挖掘的步骤。如果所学的模式确实符合所需的标准,那么最后一步就是对习得的模式进行解释并将其转化为知识。

研究 Research

The premier professional body in the field is the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD).[21][22] Since 1989, this ACM SIG has hosted an annual international conference and published its proceedings,[23] and since 1999 it has published a biannual academic journal titled "SIGKDD Explorations".[24]

The premier professional body in the field is the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD). Since 1989, this ACM SIG has hosted an annual international conference and published its proceedings, and since 1999 it has published a biannual academic journal titled "SIGKDD Explorations".

该领域的首要专业机构是计算机协会 ACM的知识发现和数据挖掘特别兴趣小组 SIGKDD。自1989年以来,ACM SIG每年举办一次国际会议并出版会议记录,自1999年起,它还出版了一份名为“SIGKDD探索”的两年期学术期刊。


Computer science conferences on data mining include:

Computer science conferences on data mining include:

关于数据挖掘的计算机科学会议包括:


CIKM会议 :ACM信息和知识管理会议 Conference on Information and Knowledge Management

欧洲机器学习与数据库知识发现原理与实践会议 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

KDD会议:ACM SIGKDD的知识发现与数据挖掘会议 Conference on Knowledge Discovery and Data Mining

Data mining topics are also present on many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases

Data mining topics are also present on many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases.

数据挖掘专题也出现在许多数据管理/数据库会议上,如 ICDE会议、 SIGMOD会议 SIGMOD Conference关于超大数据库国际会议International Conference on Very Large Data Bases

标准 Standards

There have been some efforts to define standards for the data mining process, for example, the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006 but has stalled since. JDM 2.0 was withdrawn without reaching a final draft.

There have been some efforts to define standards for the data mining process, for example, the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006 but has stalled since. JDM 2.0 was withdrawn without reaching a final draft.

为数据挖掘过程定义了一些标准,例如1999年欧洲跨行业数据挖掘标准流程(CRISP-DM 1.0)和2004年Java数据挖掘标准(JDM 1.0)。这些程序的后续程序(CRISP-DM 2.0和 JDM 2.0)的开发活跃于2006年,但此后一直停滞不前。Jdm 2.0没有达成最终草案就被撤销了。

 --Zengsihang讨论) 【审校】将“为数据挖掘过程定义了一些标准”改为“人们曾努力为数据挖掘过程定义标准”

For exchanging the extracted models – in particular for use in predictive analytics – the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.[25]

For exchanging the extracted models – in particular for use in predictive analytics – the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.

为了交换所提取的模型,特别是在预测分析中使用,关键的标准是预测模型标记语言 PMML,这是一种基于 XML 的语言,由数据挖掘集团 DMG 开发,并支持作为许多数据挖掘的交换格式的应用程序。顾名思义,它只涵盖预测模型,这是一项特殊的在商业应用中非常重要的数据挖掘任务。然而,覆盖子空间聚类的扩展已经独立于 DMG 被提出。

主要用途 Notable uses

模板:Category see also


Data mining is used wherever there is digital data available today. Notable examples of data mining can be found throughout business, medicine, science, and surveillance.

Data mining is used wherever there is digital data available today. Notable examples of data mining can be found throughout business, medicine, science, and surveillance.

数据挖掘在任何有数字数据可用的地方都可以被使用。数据挖掘的著名例子可以在商业、医学、科学和监控领域找到。

 --Thingamabob讨论) 【审校】 “数据挖掘的著名例子可以在商业、医学、科学和监控领域找到。”改为“在商业、医学、科学和监管领域都有数据挖掘的主要应用”

隐私问题和道德规范 Privacy concerns and ethics

While the term "data mining" itself may have no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise).[26]

While the term "data mining" itself may have no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise).

虽然“数据挖掘”这个术语本身可能没有伦理含义,但它通常与人们伦理和其他行为相关的信息挖掘有关。


The ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics.[27] In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.[28][29]

The ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics. In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.

在某些情况下,数据挖掘的使用方式可能会引发隐私、合法性和伦理问题。特别是,处于国家安全或执法目的而进行的政府或商业数据集的数据挖掘,如在全面信息意识项目或在 ADVISE 中引起了隐私问题。


Data mining requires data preparation which uncovers information or patterns which compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent).[30] This is not data mining per se, but a result of the preparation of data before – and for the purposes of – the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.[31][32][33]

Data mining requires data preparation which uncovers information or patterns which compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent). This is not data mining per se, but a result of the preparation of data before – and for the purposes of – the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.

数据挖掘需要进行数据准备,以发现损害机密性和隐私义务的信息或模式。实现这一点的一种常见方式是通过数据聚合 Data Aggregation。数据聚合包括以一种便于分析的方式将数据(可能来自不同的来源)组合在一起(但这也可能使私人、个人级别的数据识别变得可推断或以其他方式显而易见)。但这并不是数据挖掘导致的,而是在分析之前以及为分析目的准备数据的结果。当数据被编译后,数据挖掘者或任何有权访问新编译的数据集的人能够识别特定的个人,特别是当数据最初是匿名的时,对个人隐私的威胁就开始发挥作用了。

 --Zengsihang讨论) 【审校】将“对个人隐私的威胁就开始发挥作用了”改为“就会对个人隐私产生威胁”

It is recommended模板:Whom to be aware of the following before data are collected:[30]

It is recommended to be aware of the following before data are collected:

在收集数据之前,建议注意以下事项:

  • The purpose of the data collection and any (known) data mining projects;

数据收集和任何(已知的)数据挖掘项目的目的;

  • How the data will be used;

数据使用的方法;

  • Who will be able to mine the data and use the data and their derivatives;

谁将能够挖掘数据并使用这些数据及其衍生工具;

  • The status of security surrounding access to the data;

数据访问的安全状态;

  • How collected data can be updated.

如何更新收集的数据。

Data may also be modified so as to become anonymous, so that individuals may not readily be identified.[30] However, even ""anonymized" data sets can potentially contain enough information to allow identification of individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL.[34]

Data may also be modified so as to become anonymous, so that individuals may not readily be identified.However, even ""anonymized" data sets can potentially contain enough information to allow identification of individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL.

数据也可以被修改成匿名的,这样个人就不容易被修改了确定。但是,甚至“匿名化”的数据集也可能包含足够的信息用来识别个人,就像记者能够根据一组无意中搜索历史找到几个个人一样美国在线发布。

 --Zengsihang讨论) 【审校】将“这样个人就不容易被修改了确定”改为“这样个人就不会轻易地被识别”
 --Zengsihang讨论) 【审校】将“就像记者能够根据一组无意中搜索历史找到几个个人一样美国在线发布”改为“就像记者能够依据‘美国在线’无意中发布的用户历史记录找到一些个人”

The inadvertent revelation of personally identifiable information leading to the provider violates Fair Information Practices. This indiscretion can cause financial, emotional, or bodily harm to the indicated individual. In one instance of privacy violation, the patrons of Walgreens filed a lawsuit against the company in 2011 for selling prescription information to data mining companies who in turn provided the data to pharmaceutical companies.[35]

The inadvertent revelation of personally identifiable information leading to the provider violates Fair Information Practices. This indiscretion can cause financial, emotional, or bodily harm to the indicated individual. In one instance of privacy violation, the patrons of Walgreens filed a lawsuit against the company in 2011 for selling prescription information to data mining companies who in turn provided the data to pharmaceutical companies.


无意中泄露个人身份信息导致提供者违反了公平信息惯例。这种轻率的行为会对指定的个人造成经济、情感或身体伤害。在一起侵犯隐私的案例中,沃尔格林 Walgreens的赞助人在2011年对该公司提起诉讼,指控该公司向数据挖掘公司出售处方信息,而数据挖掘公司又将这些数据提供给制药公司。


欧洲的情况 Situation in Europe

Europe has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.-E.U. Safe Harbor Principles, developed between 1998 and 2000, currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of Edward Snowden's global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement with the United States have failed.[36]

Europe has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.-E.U. Safe Harbor Principles, developed between 1998 and 2000, currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of Edward Snowden's global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement with the United States have failed.

欧洲有相当严密的隐私法,正在努力进一步加强消费者的权利。然而,1998年至2000年期间制定的《美国-欧盟安全港原则》(U.S.-E.U.Safe Harbor Principles)目前有效地使欧洲用户受到美国公司的隐私剥削。由于爱德华·斯诺登 Edward Snowden披露了全球监控信息后,关于撤销这一协议的讨论越来越多,讨论的话题主要关于把数据完全暴露给国家安全局,与美国达成协议的尝试失败这些事上。

 --Thingamabob讨论) 【审校】"目前有效地使欧洲用户受到美国公司的隐私剥削"一句改为"在当下让欧洲用户的隐私泄露给美国公司以利用”

美国的情况 Situation in the United States

In the United States, privacy concerns have been addressed by the US Congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in Biotech Business Week, "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals."[37] This underscores the necessity for data anonymity in data aggregation and mining practices.

In the United States, privacy concerns have been addressed by the US Congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in Biotech Business Week, "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals." This underscores the necessity for data anonymity in data aggregation and mining practices.

在美国,美国国会通过了《健康保险便携性和责任法案》(HIPAA)等监管措施解决了隐私问题。HIPAA要求个人就其提供的信息及其当前和未来的预期用途给予“知情同意”。根据《生物技术商业周刊》的一篇文章,“实际上在研究领域HIPAA可能不会比长期存在的法规提供更好的保护。”。更重要的是,该规则通过知情同意进行保护的目标是接近普通个人的不可理解程度。”这突出了数据聚合和挖掘实践中数据匿名的必要性。


U.S. information privacy legislation such as HIPAA and the Family Educational Rights and Privacy Act (FERPA) applies only to the specific areas that each such law addresses. The use of data mining by the majority of businesses in the U.S. is not controlled by any legislation.

U.S. information privacy legislation such as HIPAA and the Family Educational Rights and Privacy Act (FERPA) applies only to the specific areas that each such law addresses. The use of data mining by the majority of businesses in the U.S. is not controlled by any legislation.

美国信息隐私立法,如 HIPAA 和《家庭教育权利和隐私法》(FERPA)仅适用于每一个此类法律所涉及的特定领域。美国大多数企业对数据挖掘的使用并不受任何法律的控制。

数据挖掘与著作权法 Copyright law

欧洲 Situation in Europe

Under European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is not legal. Where a database is pure data in Europe, it may be that there is no copyright模板:Snd but database rights may exist so data mining becomes subject to intellectual property owners' rights that are protected by the Database Directive. On the recommendation of the Hargreaves review, this led to the UK government to amend its copyright law in 2014 to allow content mining as a limitation and exception.[38] The UK was the second country in the world to do so after Japan, which introduced an exception in 2009 for data mining. However, due to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law also does not allow this provision to be overridden by contractual terms and conditions.

Under European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is not legal. Where a database is pure data in Europe, it may be that there is no copyright but database rights may exist so data mining becomes subject to intellectual property owners' rights that are protected by the Database Directive. On the recommendation of the Hargreaves review, this led to the UK government to amend its copyright law in 2014 to allow content mining as a limitation and exception. The UK was the second country in the world to do so after Japan, which introduced an exception in 2009 for data mining. However, due to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law also does not allow this provision to be overridden by contractual terms and conditions.

根据欧洲版权法和数据库法,未经版权所有人许可而对版权作品进行挖掘(如通过网络挖掘)是不合法的。在欧洲,如果数据库是纯数据,可能没有版权,但数据库权利可能存在,因此数据挖掘受数据库指令保护的知识产权所有者的权利约束。《哈格里夫斯评论》(Hargreaves review)指出,这使得英国政府在2014年修订了版权法,允许将内容挖掘作为一种限制和例外。英国是继日本之后世界上第二个这样做的国家,日本在2009年把数据挖掘作为一个特例。然而,由于信息社会指令(2001年)的限制,英国是例外情况只允许非商业目的的内容挖掘。英国版权法也不允许合同条款和条件推翻这一规定。

 --Zengsihang讨论) 【审校】将“英国是例外情况但是只允许给商业目的的内容挖掘”改为“英国对于内容挖掘的例外只允许非商业目的的内容挖掘”

The European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe.[39] The focus on the solution to this legal issue, such as licensing rather than limitations and exceptions, led to representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.[40]

The European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe. The focus on the solution to this legal issue, such as licensing rather than limitations and exceptions, led to representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.

2013年,欧盟委员会以“欧洲许可证”为题,推动了利益相关者对文本和数据挖掘的讨论。但他们将重点放在解决这一法律问题上,如许可证而不是限制和例外,导致大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表于2013年5月离开了利益相关者对话。

 --Thingamabob讨论) 【审校】"如许可证而不是限制和例外,导致大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表于2013年5月离开了利益相关者对话。

"改为“比如如何许可它而不是如何限制它或者把它作为一个例外,这使得大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表等利益相关者于2013年5月结束了讨论。”

美国 Situation in the United States

US copyright law, and in particular its provision for fair use, upholds the legality of content mining in America, and other fair use countries such as Israel, Taiwan and South Korea. As content mining is transformative, that is it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed - one being text and data mining.[41]

US copyright law, and in particular its provision for fair use, upholds the legality of content mining in America, and other fair use countries such as Israel, Taiwan and South Korea. As content mining is transformative, that is it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed - one being text and data mining.

美国版权法,特别是其中关于合理使用的条款,支持在美国和其他合理使用国家,如以色列,台湾和韩国采矿内容的合法性。由于内容挖掘是变革性的,也就是说,它不会取代原来的工作,它被视为合法的合理使用。例如,作为谷歌图书和解协议的一部分,此案的主审法官裁定,谷歌版权图书数字化项目是合法的,部分原因在于数字化项目所展示的变革性用途——其中之一就是文本和数据挖掘。

 --Zengsihang讨论) 【审校】将“台湾和韩国采矿内容的合法性”改为“台湾和韩国内容挖掘的合法性”

软件 Software

模板:Category see also


开源的数据挖掘软件 Free open-source data mining software and applications

The following applications are available under free/open-source licenses. Public access to application source code is also available.

The following applications are available under free/open-source licenses. Public access to application source code is also available.

下面的应用程序可以使用免费 / 开源许可证。应用程序源代码也是对公众开放访问的。


  • Carrot2: Text and search results clustering framework. 文本和搜索结果聚类框架。
  • Chemicalize.org: A chemical structure miner and web search engine. 化学结构挖掘与网络搜索引擎。
  • ELKI: A university research project with advanced cluster analysis and outlier detection methods written in the Java language. 一个大学研究项目,用Java语言编写高级聚类分析和离群点检测方法。
  • KNIME: The Konstanz Information Miner, a user-friendly and comprehensive data analytics framework. Konstanz Information Miner,一个用户友好的综合数据分析框架。
  • Massive Online Analysis (MOA): a real-time big data stream mining with concept drift tool in the Java programming language. 利用Java语言中的概念漂移工具进行实时大数据流挖掘。
  • MEPX - cross-platform tool for regression and classification problems based on a Genetic Programming variant. 基于遗传编程变量的回归和分类问题的跨平台工具。
  • ML-Flex: A software package that enables users to integrate with third-party machine-learning packages written in any programming language, execute classification analyses in parallel across multiple computing nodes, and produce HTML reports of classification results. 一种软件包,使用户能够与用任何编程语言编写的第三方机器学习包集成,跨多个计算节点并行执行分类分析,并生成分类结果的HTML报告。
  • mlpack: a collection of ready-to-use machine learning algorithms written in the C++ language. 一个用C++语言编写的机器学习算法的集合。
  • NLTK (Natural Language Toolkit): A suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python language. 一套用于Python语言的符号和统计自然语言处理(NLP)的库和程序。
  • Orange: A component-based data mining and machine learning software suite written in the Python language. 一个用Python语言编写的基于组件的数据挖掘和机器学习软件套件。
  • R: A programming language and software environment for statistical computing, data mining, and graphics. It is part of the GNU Project. 一种用于统计计算、数据挖掘和图形的编程语言和软件环境。它是GNU项目的一部分。
  • scikit-learn is an open-source machine learning library for the Python programming language 是Python编程语言的一个开源机器学习库
  • UIMA: The UIMA (Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video – originally developed by IBM. UIMA(非结构化信息管理体系结构)是一个用于分析非结构化内容(如文本、音频和视频)的组件框架,最初由IBM开发。
  • Weka: A suite of machine learning software applications written in the Java programming language. 用Java编程语言编写的一套机器学习软件应用程序。


需要专有许可的数据挖掘软件和应用程序 Proprietary data-mining software and applications

The following applications are available under proprietary licenses.

The following applications are available under proprietary licenses.

下面的应用程序可以根据专有许可证提供。


  • Angoss KnowledgeSTUDIO: data mining tool 数据挖掘工具。
  • LIONsolver: an integrated software application for data mining, business intelligence, and modeling that implements the Learning and Intelligent OptimizatioN (LION) approach. 用于数据挖掘、商业智能和建模的集成软件应用程序,实现学习和智能优化(LION)方法。
  • Megaputer Intelligence: data and text mining software is called PolyAnalyst. 数据和文本挖掘软件PolyAnalyst。
  • NetOwl: suite of multilingual text and entity analytics products that enable data mining. 支持数据挖掘的多语言文本和实体分析产品套件。
  • PSeven: platform for automation of engineering simulation and analysis, multidisciplinary optimization and data mining provided by DATADVANCE. DATADVANCE为工程仿真分析、多学科优化和数据挖掘提供自动化平台。
  • Qlucore Omics Explorer: data mining software. 数据挖掘软件。
  • RapidMiner: An environment for machine learning and data mining experiments. 一个用于机器学习和数据挖掘实验的环境。
  • SPSS Modeler: data mining software provided by IBM. IBM提供的数据挖掘软件。
  • STATISTICA Data Miner: data mining software provided by StatSoft. StatSoft提供的数据挖掘软件。
  • Tanagra: Visualisation-oriented data mining software, also for teaching. 面向可视化的数据挖掘软件,也用于教学。

扩展链接 See also

Methods

方法


Application domains

应用领域


Application examples

应用示例

模板:Category see also


Related topics

相关话题

For more information about extracting information out of data (as opposed to analyzing data) , see:

For more information about extracting information out of data (as opposed to analyzing data) , see:

有关从数据中提取信息(与分析数据相反)的详细信息,请参阅:

Other resources

Other resources

其他资源

参考文献 References

  1. "Data Mining Curriculum". ACM SIGKDD. 2006-04-30. Retrieved 2014-01-27.
  2. Clifton, Christopher (2010). "Encyclopædia Britannica: Definition of Data Mining". Retrieved 2010-12-09.
  3. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "The Elements of Statistical Learning: Data Mining, Inference, and Prediction". Archived from the original on 2009-11-10. Retrieved 2012-08-07.
  4. Han, Kamber, Pei, Jaiwei, Micheline, Jian (June 9, 2011). Data Mining: Concepts and Techniques (3rd ed.). Morgan Kaufmann. ISBN 978-0-12-381479-1. 
  5. Han, Jiawei; Kamber, Micheline (2001). Data mining: concepts and techniques. Morgan Kaufmann. p. 5. ISBN 978-1-55860-489-6. "Thus, data mining should have been more appropriately named "knowledge mining from data," which is unfortunately somewhat long" 
  6. OKAIRP 2005 Fall Conference, Arizona State University -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期2014-02-01.
  7. Witten, Ian H.; Frank, Eibe; Hall, Mark A. (30 January 2011). Data Mining: Practical Machine Learning Tools and Techniques (3 ed.). Elsevier. ISBN 978-0-12-374856-0. 
  8. Bouckaert, Remco R.; Frank, Eibe; Hall, Mark A.; Holmes, Geoffrey; Pfahringer, Bernhard; Reutemann, Peter; Witten, Ian H. (2010). "WEKA Experiences with a Java open-source project". Journal of Machine Learning Research. 11: 2533–2541. the original title, "Practical machine learning", was changed ... The term "data mining" was [added] primarily for marketing reasons.
  9. Olson, D. L. (2007). Data mining in business services. Service Business, 1(3), 181-193. doi:10.1007/s11628-006-0014-7
  10. Lovell, Michael C. (1983). "Data Mining". The Review of Economics and Statistics. 65 (1): 1–12. doi:10.2307/1924403. JSTOR 1924403.
  11. Charemza, Wojciech W.; Deadman, Derek F. (1992). "Data Mining". New Directions in Econometric Practice. Aldershot: Edward Elgar. pp. 14–31. ISBN 1-85278-461-X. 
  12. Mena, Jesús (2011). Machine Learning Forensics for Law Enforcement, Security, and Intelligence. Boca Raton, FL: CRC Press (Taylor & Francis Group). ISBN 978-1-4398-6069-4. 
  13. Piatetsky-Shapiro, Gregory; Parker, Gary (2011). "Lesson: Data Mining, and Knowledge Discovery: An Introduction". Introduction to Data Mining. KD Nuggets. Retrieved 30 August 2012.
  14. Fayyad, Usama (15 June 1999). "First Editorial by Editor-in-Chief". SIGKDD Explorations. 13 (1): 102. doi:10.1145/2207243.2207269. Retrieved 27 December 2010.
  15. Kantardzic, Mehmed (2003). Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons. ISBN 978-0-471-22852-3. OCLC 50055336. https://archive.org/details/dataminingconcep0000kant. 
  16. 16.0 16.1 Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996). "From Data Mining to Knowledge Discovery in Databases" (PDF). Retrieved 17 December 2008.
  17. Gregory Piatetsky-Shapiro (2002) KDnuggets Methodology Poll, Gregory Piatetsky-Shapiro (2004) KDnuggets Methodology Poll, Gregory Piatetsky-Shapiro (2007) KDnuggets Methodology Poll, Gregory Piatetsky-Shapiro (2014) KDnuggets Methodology Poll
  18. Lukasz Kurgan and Petr Musilek (2006); A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review. Volume 21 Issue 1, March 2006, pp 1–24, Cambridge University Press, New York, NY, USA 脚本错误:没有“Vorlage:Handle”这个模块。
  19. Azevedo, A. and Santos, M. F. KDD, SEMMA and CRISP-DM: a parallel overview -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期2013-01-09.. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182–185.
  20. Hawkins, Douglas M (2004). "The problem of overfitting". Journal of Chemical Information and Computer Sciences. 44 (1): 1–12. doi:10.1021/ci0342472. PMID 14741005.
  21. "Microsoft Academic Search: Top conferences in data mining". Microsoft Academic Search.
  22. "Google Scholar: Top publications - Data Mining & Analysis". Google Scholar.
  23. Proceedings -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期2010-04-30., International Conferences on Knowledge Discovery and Data Mining, ACM, New York.
  24. SIGKDD Explorations, ACM, New York.
  25. Günnemann, Stephan; Kremer, Hardy; Seidl, Thomas (2011). "An extension of the PMML standard to subspace clustering models". Proceedings of the 2011 workshop on Predictive markup language modeling - PMML '11. pp. 48. doi:10.1145/2023598.2023605. ISBN 978-1-4503-0837-3. 
  26. Seltzer, William (2005). "The Promise and Pitfalls of Data Mining: Ethical Issues" (PDF). ASA Section on Government Statistics. American Statistical Association.
  27. Pitts, Chip (15 March 2007). "The End of Illegal Domestic Spying? Don't Count on It". Washington Spectator. Archived from the original on 2007-11-28.
  28. Taipale, Kim A. (15 December 2003). "Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data". Columbia Science and Technology Law Review. 5 (2). OCLC 45263753. SSRN 546782.
  29. Resig, John. "A Framework for Mining Instant Messaging Services" (PDF). Retrieved 16 March 2018.
  30. 30.0 30.1 30.2 Think Before You Dig: Privacy Implications of Data Mining & Aggregation -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期2008-12-17., NASCIO Research Brief, September 2004
  31. Ohm, Paul. "Don't Build a Database of Ruin". Harvard Business Review.
  32. Darwin Bond-Graham, Iron Cagebook - The Logical End of Facebook's Patents, Counterpunch.org, 2013.12.03
  33. Darwin Bond-Graham, Inside the Tech industry's Startup Conference, Counterpunch.org, 2013.09.11
  34. AOL search data identified individuals, SecurityFocus, August 2006
  35. Kshetri, Nir (2014). "Big data׳s impact on privacy, security and consumer welfare" (PDF). Telecommunications Policy. 38 (11): 1134–1145. doi:10.1016/j.telpol.2014.10.002.
  36. Weiss, Martin A.; Archick, Kristin (19 May 2016). "U.S.-E.U. Data Privacy: From Safe Harbor to Privacy Shield" (PDF). Washington, D.C. Congressional Research Service. p. 6. R44257. Retrieved 9 April 2020. On October 6, 2015, the CJEU ... issued a decision that invalidated Safe Harbor (effective immediately), as currently implemented.
  37. Biotech Business Week Editors (June 30, 2008); BIOMEDICINE; HIPAA Privacy Rule Impedes Biomedical Research, Biotech Business Week, retrieved 17 November 2009 from LexisNexis Academic
  38. UK Researchers Given Data Mining Right Under New UK Copyright Laws. -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期June 9, 2014,. Out-Law.com. Retrieved 14 November 2014
  39. "Licences for Europe - Structured Stakeholder Dialogue 2013". European Commission. Retrieved 14 November 2014.
  40. "Text and Data Mining:Its importance and the need for change in Europe". Association of European Research Libraries. Retrieved 14 November 2014.
  41. "Judge grants summary judgment in favor of Google Books — a fair use victory". Lexology.com. Antonelli Law Ltd. Retrieved 14 November 2014.

进一步阅读 Further reading

相关链接External links

模板:Commons category


模板:Data

模板:Data warehouse

模板:Computer science

Category:Formal sciences

类别: 正规科学


This page was moved from wikipedia:en:Data mining. Its edit history can be viewed at 数据挖掘/edithistory