数据挖掘

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
跳到导航 跳到搜索

此词条暂由彩云小译翻译,未经人工整理和审校,带来阅读不便,请见谅。

模板:Machine learning bar


Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.[1] Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use.[1][2][3][4] Data mining is the analysis step of the "knowledge discovery in databases" process or KDD.[5] Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.[1]

Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is the analysis step of the "knowledge discovery in databases" process or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

数据挖掘 Data Mining 是在大型数据集中发现模式的过程,是一种涉及到机器学习、统计学和数据库系统综合使用的方法。数据挖掘是指“数据库中的知识发现 KDD”的过程的分析步骤。除了传统的分析步骤,它还涉及数据库和数据管理方面,包括数据预处理、模型和推理考虑、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。


The term "data mining" is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself.[6] It also is a buzzword[7] and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java[8] (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons.[9] Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.

The term "data mining" is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.

“数据挖掘”这种形容其实并不十分恰当,因为我们的目标是从大量数据中提取模式和知识,而不是数据本身的提取(挖掘)。它是一个流行语,经常用于任何形式的大规模数据或信息处理(收集、提取、仓储、分析和统计)的场景下,以及 计算机决策系统 Decision Support System,DSS的任何应用当中,包括人工智能(例如机器学习)和商业智能。《数据挖掘:使用Java的实用机器学习工具和技术》(主要涵盖机器学习材料)一书最初被命名为“实用机器学习”,而数据挖掘一词只是为了营销的原因而增加。经常更一般的术语例如(大规模)数据分析和分析——或当提到实际的方法时使用人工智能和机器学习这样的词语更加合适。


The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.

实际上数据挖掘任务是对大量数据进行半自动或全自动分析,以提取出从前未知的且有趣的模式,如数据记录组(数据聚类)、异常记录组(异常检测)和依赖关系(关联规则挖掘,序列挖掘)。这通常涉及使用数据库技术,如空间索引。这些模式可以被看作是输入数据的一种汇总,并且可以用于进一步的分析,或者,例如,机器学习和预测分析。例如,数据挖掘步骤可以识别数据中的多个组,然后可以使用该步骤通过决策支持系统获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分,而是作为附加步骤属于整个 KDD 过程。

如数据记录组(聚类分析 Cluster Analysis)、异常记录(异常检测 Anomaly Detection)和依赖关系(关联规则挖掘 Association Rule Mining序列模式挖掘 Sequential Pattern Mining)。这通常涉及到使用数据库技术,如空间索引。这些模式可以被看作是输入数据的一种规律总结,可以用于进一步的分析,或者,例如,在机器学习和预测分析中。例如,通过数据挖掘可以出识别数据中的多个组,然后这些组可以通过使用决策支持系统来获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分,而是属于整个KDD过程的附加步骤。

The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.[10]

The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.

数据分析 Data Analysis和数据挖掘的区别在于,数据分析用于测试数据集上的模型和假设,例如,分析营销活动的有效性,而不考虑数据量的多少;相反,数据挖掘使用机器学习和统计模型来发现“大量”数据中的秘密或隐藏模式。


The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

相关术语“数据疏浚” Data Dredging、“数据钓鱼”和“数据窥探”是指使用数据挖掘方法对较大的人口数据集中的一部分进行抽样,这些数据集太小(或可能太小),无法对所发现的任何模式的有效性作出可靠的统计推断。但是,这些方法可以用于提出新的假设,以针对更大的数据群体进行测试。


起源 Etymology

In the 1960s, statisticians and economists used terms like data fishing or data dredging to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term "data mining" was used in a similarly critical way by economist Michael Lovell in an article published in the Review of Economic Studies in 1983.[11][12] Lovell indicates that the practice "masquerades under a variety of aliases, ranging from "experimentation" (positive) to "fishing" or "snooping" (negative).

In the 1960s, statisticians and economists used terms like data fishing or data dredging to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term "data mining" was used in a similarly critical way by economist Michael Lovell in an article published in the Review of Economic Studies in 1983. Lovell indicates that the practice "masquerades under a variety of aliases, ranging from "experimentation" (positive) to "fishing" or "snooping" (negative).

在20世纪60年代,统计学家和经济学家们曾经使用“数据钓鱼”或”数据疏浚“等术语来指代他们认为在没有先验假设的情况下进行数据分析的糟糕做法。经济学家迈克尔•洛弗尔 Michael Lovell 在1983年《经济研究评论》(Review of Economic Studies)上发表的一篇文章中,也以类似的批判方式使用了“数据挖掘”这个术语。Lovell 指出,这种做法“伪装成各种别名,从“实验”(正面)到“钓鱼”或“窥探”(负面)。


The term data mining appeared around 1990 in the database community, generally with positive connotations. For a short time in 1980s, a phrase "database mining"™, was used, but since it was trademarked by HNC, a San Diego-based company, to pitch their Database Mining Workstation;[13] researchers consequently turned to data mining. Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities.[14] Currently, the terms data mining and knowledge discovery are used interchangeably.

The term data mining appeared around 1990 in the database community, generally with positive connotations. For a short time in 1980s, a phrase "database mining"™, was used, but since it was trademarked by HNC, a San Diego-based company, to pitch their Database Mining Workstation; researchers consequently turned to data mining. Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities. Currently, the terms data mining and knowledge discovery are used interchangeably.

数据挖掘这个术语在1990年左右出现在数据库领域,通常有着积极的内涵。在20世纪80年代的一段短暂时间里,人们曾使用过“数据库挖掘”这种表达,但由于这个词被总部位于圣地亚哥的 HNC 公司注册为商标,因此研究人员转向了数据挖掘。曾用过的其他术语包括数据考古学、信息收集、信息发现、知识提取等。格雷戈里·皮亚特斯基·夏皮罗 Gregory Piatetsky-Shapiro 在关于这个主题的第一个研讨会[ http://www.kdnuggets.com/meetings/kdd89/ (KDD-1989)] 上首次创造了“数据库中的知识发现 Knowledge Discovery in Databases,KDD”这个术语。此后,这个术语在人工智能和机器学习社区中变得更加流行。然而,数据挖掘这个术语在商业和出版界变得越来越流行。目前,数据挖掘和知识发现 knowledge discovery这两个术语可以互换使用。


In the academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery (KDD-95) was started in Montreal under AAAI sponsorship. It was co-chaired by Usama Fayyad and Ramasamy Uthurusamy. A year later, in 1996, Usama Fayyad launched the journal by Kluwer called Data Mining and Knowledge Discovery as its founding editor-in-chief. Later he started the SIGKDD Newsletter SIGKDD Explorations.[15] The KDD International conference became the primary highest quality conference in data mining with an acceptance rate of research paper submissions below 18%. The journal Data Mining and Knowledge Discovery is the primary research journal of the field.


In the academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery (KDD-95) was started in Montreal under AAAI sponsorship. It was co-chaired by Usama Fayyad and Ramasamy Uthurusamy. A year later, in 1996, Usama Fayyad launched the journal by Kluwer called Data Mining and Knowledge Discovery as its founding editor-in-chief. Later he started the SIGKDD Newsletter SIGKDD Explorations.The KDD International conference became the primary highest quality conference in data mining with an acceptance rate of research paper submissions below 18%. The journal Data Mining and Knowledge Discovery is the primary research journal of the field.

在学术界,主要的研究论坛始于1995年,当时,在AAAI的赞助下,第一届数据挖掘和知识发现国际会议(KDD-95)在蒙特利尔召开。会议由乌萨马·法耶兹 Usama Fayyad和拉玛萨米·乌图鲁萨米 Ramasamy Uthurusamy共同主持。一年后,1996年Usama Fayyad创办了杂志《数据挖掘与知识发现》(datamining and Knowledge Discovery),担任创始主编。后来他创办了SIGKDD时事通讯探索。那个KDD国际会议也成为了数据挖掘领域质量最高的主要会议,其研究论文提交的接受率低于18%,而《数据挖掘与知识发现》也成为了该领域的主要研究期刊。

背景 Background

The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology have dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, specially in the field of machine learning, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns[16] in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever-larger data sets.

The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology have dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, specially in the field of machine learning, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever-larger data sets.

从数据中手动提取模式的方法已经持续了好几个世纪了。早期识别数据模式的方法包括17世纪的贝叶斯定理 Bayes' Theorem和19世纪的回归分析 Regression Analysis。计算机技术的扩散、其普遍性和日益强大的能力极大地提高了数据的收集、存储和操作能力。随着数据集的规模和复杂性的增长,手动分析数据的方法越来越多地被更强的间接、自动化的数据处理所取代,这都得益于计算机科学其他领域取得的新的进步,特别是机器学习领域的神经网络 Neural Networks聚类分析 Cluster Analysis遗传算法 Genetic Algorithms(1950年代),决策树 Decision Tree决策规则 Decision Rules(1960年代)以及支持向量机 Support Vector Machines(1990年代)等。数据挖掘就是应用这些方法来发现大型数据集中的隐藏模式的过程。它利用数据在数据库中存储和索引的方式,更有效地执行实际的学习和发现算法,从而弥补了从应用统计学和人工智能(通常提供数学背景)到数据库管理之间的差距,使这些方法能够应用于更大的数据集。

发展过程 Process

The knowledge discovery in databases (KDD) process is commonly defined with the stages:

The knowledge discovery in databases (KDD) process is commonly defined with the stages:

数据库中的知识发现 Knowledge Discovery in Databases ,KDD过程通常定义为以下几个阶段:


  1. Selection
Selection

选择

  1. Pre-processing
Pre-processing

预处理

  1. Transformation
Transformation

转换

  1. Data mining
Data mining

数据挖掘

  1. Interpretation/evaluation.[5]
Interpretation/evaluation.

解释 / 评估。


It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:

It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:

然而,它存在于这个主题的许多变体中,例如在数据挖掘的跨行业标准流程 Cross-industry standard process for data mining,CRISP-DM中它定义了以下六个阶段:


  1. Business understanding
Business understanding

商业理解

  1. Data understanding
Data understanding

数据理解

  1. Data preparation
Data preparation

数据准备

  1. Modeling
Modeling

建模

  1. Evaluation
Evaluation

评估

  1. Deployment
Deployment

部署


or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.

or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.

或一个简化的过程,包括:(1)预处理,(2)数据挖掘,(3)结果验证。


Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners.[17] The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models,[18] and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.[19]

Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners. The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models, and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.

在这些调查中,唯一的其他数据挖掘标准是SEMMA。然而,使用CRISP-DM的人数是其3-4倍。一些研究小组已经发表了关于数据挖掘过程模型的研究,例如阿泽维多 Azevedo和 桑托斯Santos曾在2008年对CRISP-DM和SEMMA这两套数据挖掘流程标准进行了比较。

预处理 Pre-processing

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.

在使用数据挖掘算法之前,必须先对目标数据集进行整合。由于数据挖掘只能发现数据中实际存在的模式,目标数据集必须足够大以包含这些模式,同时保持足够简洁以便在可接受的时间限制内进行挖掘。数据的公共源是数据集市或数据仓库。在数据挖掘之前,对多变量数据集进行预处理是必不可少的。然后清理目标集。数据清理去除了包含噪声的观测值和缺失数据的观测值。

在使用数据挖掘算法之前,必须组装目标数据集。由于数据挖掘只能发现数据中实际存在的模式,因此目标数据集必须足够大以包含这些模式,同时保持足够简洁,以便在可接受的时间限制内进行挖掘。数据的常见来源是数据集市 Data Mart数据仓库 Data Warehouse。在数据挖掘之前,对多元 Multivariate数据集进行预处理是必不可少的,然后对目标集进行清洗。数据清洗将删除包含噪声 Noise的观测值和缺失数据 Missing Data的观测值。

数据挖掘 Data mining

Data mining involves six common classes of tasks:[5]

Data mining involves six common classes of tasks:

数据挖掘涉及六类常见的任务:


  • Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.

异常检测 Anomaly detection(异常值/变化/偏差检测):识别异常数据记录,发现可能是有趣的或需要进一步调查的数据错误。

  • Association rule learning (dependency modeling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.

关联规则学习 Association rule learning(依赖关系建模):探变量之间的关系。例如,超市可能会收集顾客购买习惯的数据。通过使用关联规则学习,超市可以确定哪些产品经常被一起购买,并将这些信息用于营销策略改进。这种研究有时被称为“市场篮子分析”。

  • Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.

聚类 Clustering:是指在数据中发现以某种方式或其他方式“相似”的组和结构,而不使用数据中已知的结构。

  • Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".

分类 Classification:是将已知结构归纳为新数据的任务。例如,电子邮件程序可能会尝试将电子邮件分类为“合法”或“垃圾邮件”。

  • Regression – attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets.

回归:试图找到一个对数据建模误差最小的函数,也就是说,用于估计数据或数据集之间的关系。

  • Summarization – providing a more compact representation of the data set, including visualization and report generation.

自动文摘 Automatic summarizatio:提供数据集更紧凑、简洁的表示,包括可视化和报告生成。

结果验证 Results validation

文件:Spurious correlations - spelling bee spiders.svg
An example of data produced by data dredging through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence. 一个由统计学家泰勒·维根 Tyler Vigen操作的机器人进行数据挖掘所产生的数据,显然表明在拼字比赛中获胜的最佳单词与美国被毒蜘蛛杀死的人数之间有着密切的联系。但是显然这种趋势上的相似仅仅是一个巧合。

An example of data produced by data dredging through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence.

一个由统计学家泰勒·维根 Tyler Vigen操作的机器人进行数据挖掘所产生的数据,显然表明在拼字比赛中获胜的最佳单词与美国被毒蜘蛛杀死的人数之间有着密切的联系。但是显然这种趋势上的相似仅仅是一个巧合。

Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.[20]

Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.

数据挖掘可能会在无意中被误用,然后产生看似重要的结果; 但这些结果实际上并不能用来预测未来的行为,也不能在新的数据样本上进行复现,而且用处不大。这通常是由于调查了太多的假设,而没有进行适当的统计假设检验 Statistical Hypothesis Testing。在机器学习中,这种问题可以被简称为过拟合 Overfitting,但相同的问题可能会在过程的不同阶段出现,因此,在完全适用的情况下,合理进行训练/测试分割这一种方法可能不足以防止这种情况的发生。

模板:Missing information

The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as ROC curves.

The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as ROC curves.

从数据中发现知识的最后一步是验证数据挖掘算法产生的模式是否存在于更广泛的数据集中。数据挖掘算法发现的并非所有模式都是有效的,因为对于数据挖掘算法来说,在训练集中发现一般数据集中没有的模式是很常见的,这叫做过拟合 Overfitting。为了克服这个问题,评估使用一组测试数据,而数据挖掘算法并没有在这些测试数据上进行训练。然后将学习到的模式应用到这个测试集 Test Set中,并将结果输出与期望的输出进行比较。例如,试图区分“垃圾邮件”和“合法”邮件的数据挖掘算法将根据一组电子邮件训练集 Training Sett样本进行训练。训练完毕后,学到的模式将应用于未经训练的那部分电子邮件测试集数据上。然后,可以从这些模式正确分类的电子邮件数量来衡量这些模式的准确性。可以使用几种统计方法可以用来评估算法,如ROC 曲线 ROC curves

If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.

If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.

如果学习的模式不能达到预期的标准,那么就需要重新评估和修改预处理和数据挖掘的步骤。如果所学的模式确实符合所需的标准,那么最后一步就是对习得的模式进行解释并将其转化为知识。

研究 Research

The premier professional body in the field is the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD).[21][22] Since 1989, this ACM SIG has hosted an annual international conference and published its proceedings,[23] and since 1999 it has published a biannual academic journal titled "SIGKDD Explorations".[24]

The premier professional body in the field is the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD). Since 1989, this ACM SIG has hosted an annual international conference and published its proceedings, and since 1999 it has published a biannual academic journal titled "SIGKDD Explorations".

该领域的首要专业机构是计算机协会 ACM的知识发现和数据挖掘特别兴趣小组 SIGKDD。自1989年以来,ACM SIG每年举办一次国际会议并出版会议记录,自1999年起,它还出版了一份名为“SIGKDD探索”的两年期学术期刊。


Computer science conferences on data mining include:

Computer science conferences on data mining include:

关于数据挖掘的计算机科学会议包括:


CIKM会议 :ACM信息和知识管理会议 Conference on Information and Knowledge Management

欧洲机器学习与数据库知识发现原理与实践会议 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

KDD会议:ACM SIGKDD的知识发现与数据挖掘会议 Conference on Knowledge Discovery and Data Mining

Data mining topics are also present on many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases

Data mining topics are also present on many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases.

数据挖掘专题也出现在许多数据管理/数据库会议上,如 ICDE会议、 SIGMOD会议 SIGMOD Conference关于超大数据库国际会议International Conference on Very Large Data Bases

标准 Standards

There have been some efforts to define standards for the data mining process, for example, the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006 but has stalled since. JDM 2.0 was withdrawn without reaching a final draft.

There have been some efforts to define standards for the data mining process, for example, the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006 but has stalled since. JDM 2.0 was withdrawn without reaching a final draft.

已经有一些工作为数据挖掘过程定义标准,例如,1999年欧洲跨行业数据挖掘标准过程(CRISP-DM 1.0)和2004年 Java 数据挖掘标准(JDM 1.0)。这些程序的后续程序(CRISP-DM 2.0和 JDM 2.0)的开发活跃于2006年,但此后一直停滞不前。Jdm 2.0没有达成最终草案就被撤销了。


For exchanging the extracted models – in particular for use in predictive analytics – the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.[25]

For exchanging the extracted models – in particular for use in predictive analytics – the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.

为了交换所提取的模型,特别是在预测分析中使用,关键的标准是预测模型标记语言(PMML) ,这是一种基于 xml 的语言,由数据挖掘集团(DMG)开发,并支持作为交换格式的许多数据挖掘应用程序。顾名思义,它只涵盖预测模型,这是一项对业务应用程序非常重要的特殊数据挖掘任务。然而,覆盖子空间聚类的扩展(例如)已经独立于 DMG 被提出。

显著用途 Notable uses

模板:Category see also


Data mining is used wherever there is digital data available today. Notable examples of data mining can be found throughout business, medicine, science, and surveillance.

Data mining is used wherever there is digital data available today. Notable examples of data mining can be found throughout business, medicine, science, and surveillance.

数据挖掘在任何有数字数据可用的地方都被使用。数据挖掘的著名例子可以在商业、医学、科学和监控领域找到。

Privacy concerns and ethics

While the term "data mining" itself may have no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise).[26]

While the term "data mining" itself may have no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise).

虽然“数据挖掘”这个术语本身可能没有伦理含义,但它通常与人们行为(伦理和其他)相关的信息挖掘有关。


The ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics.[27] In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.[28][29]

The ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics. In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.

在某些情况下,数据挖掘的使用方式会引起关于隐私、合法性和道德的问题。特别是,出于国家安全或执法目的的数据挖掘政府或商业数据集,如在全面信息意识项目或在 ADVISE 中,引起了隐私问题。


Data mining requires data preparation which uncovers information or patterns which compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent).[30] This is not data mining per se, but a result of the preparation of data before – and for the purposes of – the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.[31][32][33]

Data mining requires data preparation which uncovers information or patterns which compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent). This is not data mining per se, but a result of the preparation of data before – and for the purposes of – the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.

数据挖掘需要进行数据准备,以发现损害机密性和隐私义务的信息或模式。发生这种情况的一种常见方式是通过数据聚合。数据聚合涉及以一种有利于分析的方式将数据组合在一起(可能来自不同的来源)(但这也可能使私有的、个人级别的数据的识别可以推断或以其他方式显而易见)。这本身并不是数据挖掘,而是在分析之前准备数据的结果,也是为了分析的目的。对个人隐私的威胁发挥作用时,数据,一旦编译,使数据矿工,或任何人谁有权访问新编译的数据集,能够识别具体的个人,特别是当数据最初是匿名的。


It is recommended模板:Whom to be aware of the following before data are collected:[30]

It is recommended to be aware of the following before data are collected:

在收集数据之前,建议注意以下事项:

  • The purpose of the data collection and any (known) data mining projects;
  • How the data will be used;
  • Who will be able to mine the data and use the data and their derivatives;
  • The status of security surrounding access to the data;
  • How collected data can be updated.


Data may also be modified so as to become anonymous, so that individuals may not readily be identified.[30] However, even ""anonymized" data sets can potentially contain enough information to allow identification of individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL.[34]

Data may also be modified so as to become anonymous, so that individuals may not readily be identified.

数据也可能被修改成匿名的,这样个人就不容易被识别。


The inadvertent revelation of personally identifiable information leading to the provider violates Fair Information Practices. This indiscretion can cause financial,

The inadvertent revelation of personally identifiable information leading to the provider violates Fair Information Practices. This indiscretion can cause financial,

无意中泄露的个人身份信息信息导致供应商违反了公平信息惯例。这种轻率的行为会导致经济上的,

emotional, or bodily harm to the indicated individual. In one instance of privacy violation, the patrons of Walgreens filed a lawsuit against the company in 2011 for selling

emotional, or bodily harm to the indicated individual. In one instance of privacy violation, the patrons of Walgreens filed a lawsuit against the company in 2011 for selling

对指定个人的情感或身体伤害。在一起侵犯隐私的案例中,沃尔格林的赞助人在2011年对沃尔格林公司提起诉讼,指控其销售

prescription information to data mining companies who in turn provided the data

prescription information to data mining companies who in turn provided the data

处方信息提供给数据挖掘公司,而这些公司反过来又提供了数据

to pharmaceutical companies.[35]

to pharmaceutical companies.

给制药公司。


Situation in Europe

Europe has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.-E.U. Safe Harbor Principles, developed between 1998 and 2000, currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of Edward Snowden's global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement with the United States have failed.[36]

Europe has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.-E.U. Safe Harbor Principles, developed between 1998 and 2000, currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of Edward Snowden's global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement with the United States have failed.

欧洲有相当严格的隐私法,并且正在努力进一步加强消费者的权利。然而,美国和欧盟。1998年至2000年间开发的安全港原则,目前有效地将欧洲用户暴露在美国公司的隐私剥削之下。由于爱德华 · 斯诺登(Edward Snowden)披露了全球监控信息,撤销这项协议的讨论有所增加,特别是数据将完全暴露给美国国家安全局(National Security Agency) ,与美国达成协议的尝试也失败了。


Situation in the United States

In the United States, privacy concerns have been addressed by the US Congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in Biotech Business Week, "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals."[37] This underscores the necessity for data anonymity in data aggregation and mining practices.

In the United States, privacy concerns have been addressed by the US Congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in Biotech Business Week, "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals." This underscores the necessity for data anonymity in data aggregation and mining practices.

在美国,隐私问题已经通过美国国会通过的监管控制措施得到解决,比如美国健康保险便利和责任法案保护局(HIPAA)。该法要求个人就其提供的信息及其目前和未来的预期用途作出”知情同意”。根据《生物技术商业周刊》的一篇文章,“在实践中,HIPAA 可能不会提供任何比研究领域长期存在的规定更好的保护,” AAHC 说。更重要的是,该规则的目标是通过知情同意的保护是接近一般个人不可理解的水平。”这强调了数据聚合和挖掘实践中数据匿名的必要性。


U.S. information privacy legislation such as HIPAA and the Family Educational Rights and Privacy Act (FERPA) applies only to the specific areas that each such law addresses. The use of data mining by the majority of businesses in the U.S. is not controlled by any legislation.

U.S. information privacy legislation such as HIPAA and the Family Educational Rights and Privacy Act (FERPA) applies only to the specific areas that each such law addresses. The use of data mining by the majority of businesses in the U.S. is not controlled by any legislation.

美国信息隐私立法,如 HIPAA 和《家庭教育权利和隐私法》(FERPA)仅适用于这些法律所涉及的具体领域。美国大多数企业对数据挖掘的使用并不受任何法律的控制。


数据挖掘与著作权法 Copyright law

欧洲 Situation in Europe

Under European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is not legal. Where a database is pure data in Europe, it may be that there is no copyright模板:Snd but database rights may exist so data mining becomes subject to intellectual property owners' rights that are protected by the Database Directive. On the recommendation of the Hargreaves review, this led to the UK government to amend its copyright law in 2014 to allow content mining as a limitation and exception.[38] The UK was the second country in the world to do so after Japan, which introduced an exception in 2009 for data mining. However, due to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law also does not allow this provision to be overridden by contractual terms and conditions.

Under European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is not legal. Where a database is pure data in Europe, it may be that there is no copyright but database rights may exist so data mining becomes subject to intellectual property owners' rights that are protected by the Database Directive. On the recommendation of the Hargreaves review, this led to the UK government to amend its copyright law in 2014 to allow content mining as a limitation and exception. The UK was the second country in the world to do so after Japan, which introduced an exception in 2009 for data mining. However, due to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law also does not allow this provision to be overridden by contractual terms and conditions.

根据欧洲的版权和数据库法律,未经版权所有者的许可对版权内作品(如 web 挖掘)进行挖掘是不合法的。在欧洲,数据库是纯粹的数据,可能没有版权,但数据库权利可能存在,因此数据挖掘受到受数据库指令保护的知识产权所有者权利的约束。根据 Hargreaves 审查的建议,这导致英国政府在2014年修订其版权法,允许内容挖掘作为一种限制和例外。英国是继日本之后第二个这样做的国家,日本在2009年引入了数据挖掘的例外。然而,由于信息社会指令(2001)的限制,英国的例外只允许非商业目的的内容挖掘。英国版权法也不允许合同条款和条件推翻这一规定。


The European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe.[39] The focus on the solution to this legal issue, such as licensing rather than limitations and exceptions, led to representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.[40]

The European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe. The focus on the solution to this legal issue, such as licensing rather than limitations and exceptions, led to representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.

2013年,欧洲委员会促进了利益攸关方在欧洲许可证标题下关于文本和数据挖掘的讨论。将重点放在解决这一法律问题上,如许可证而不是限制和例外,导致大学、研究人员、图书馆、民间社会团体和开放存取出版商的代表在2013年5月离开了利益攸关方对话。


美国 Situation in the United States

US copyright law, and in particular its provision for fair use, upholds the legality of content mining in America, and other fair use countries such as Israel, Taiwan and South Korea. As content mining is transformative, that is it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed - one being text and data mining.[41]

US copyright law, and in particular its provision for fair use, upholds the legality of content mining in America, and other fair use countries such as Israel, Taiwan and South Korea. As content mining is transformative, that is it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed - one being text and data mining.

美国版权法,特别是其中关于合理使用的条款,支持在美国和其他合理使用国家,如以色列,台湾和韩国采矿内容的合法性。由于内容挖掘是变革性的,也就是说,它不会取代原来的工作,它被视为合法的合理使用。例如,作为谷歌图书和解协议的一部分,此案的主审法官裁定,谷歌版权图书数字化项目是合法的,部分原因在于数字化项目所展示的变革性用途——其中之一就是文本和数据挖掘。

软件 Software

模板:Category see also


开源的数据挖掘软件 Free open-source data mining software and applications

The following applications are available under free/open-source licenses. Public access to application source code is also available.

The following applications are available under free/open-source licenses. Public access to application source code is also available.

下面的应用程序可以使用免费 / 开放源码许可证。应用程序源代码的公共访问也是可用的。


  • Carrot2: Text and search results clustering framework.
  • KNIME: The Konstanz Information Miner, a user-friendly and comprehensive data analytics framework.
  • MEPX - cross-platform tool for regression and classification problems based on a Genetic Programming variant.
  • ML-Flex: A software package that enables users to integrate with third-party machine-learning packages written in any programming language, execute classification analyses in parallel across multiple computing nodes, and produce HTML reports of classification results.
  • mlpack: a collection of ready-to-use machine learning algorithms written in the C++ language.
  • scikit-learn is an open-source machine learning library for the Python programming language
  • UIMA: The UIMA (Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video – originally developed by IBM.
  • Weka: A suite of machine learning software applications written in the Java programming language.


需要专有许可的数据挖掘软件和应用程序 Proprietary data-mining software and applications

The following applications are available under proprietary licenses.

The following applications are available under proprietary licenses.

下面的应用程序可以根据专有许可证提供。


  • Angoss KnowledgeSTUDIO: data mining tool
  • LIONsolver: an integrated software application for data mining, business intelligence, and modeling that implements the Learning and Intelligent OptimizatioN (LION) approach.
  • Megaputer Intelligence: data and text mining software is called PolyAnalyst.
  • NetOwl: suite of multilingual text and entity analytics products that enable data mining.
  • PSeven: platform for automation of engineering simulation and analysis, multidisciplinary optimization and data mining provided by DATADVANCE.
  • Qlucore Omics Explorer: data mining software.
  • Tanagra: Visualisation-oriented data mining software, also for teaching.

扩展链接 See also

Methods

方法


Application domains

应用领域


Application examples

应用示例

模板:Category see also


Related topics

相关话题

For more information about extracting information out of data (as opposed to analyzing data) , see:

For more information about extracting information out of data (as opposed to analyzing data) , see:

有关从数据中提取信息(与分析数据相反)的详细信息,请参阅:

Other resources

Other resources

其他资源

参考文献 References

  1. 1.0 1.1 1.2 "Data Mining Curriculum". ACM SIGKDD. 2006-04-30. Retrieved 2014-01-27.
  2. Clifton, Christopher (2010). "Encyclopædia Britannica: Definition of Data Mining". Retrieved 2010-12-09.
  3. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "The Elements of Statistical Learning: Data Mining, Inference, and Prediction". Archived from the original on 2009-11-10. Retrieved 2012-08-07.
  4. Han, Kamber, Pei, Jaiwei, Micheline, Jian (June 9, 2011). Data Mining: Concepts and Techniques (3rd ed.). Morgan Kaufmann. ISBN 978-0-12-381479-1. 
  5. 5.0 5.1 5.2 Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996). "From Data Mining to Knowledge Discovery in Databases" (PDF). Retrieved 17 December 2008.
  6. Han, Jiawei; Kamber, Micheline (2001). Data mining: concepts and techniques. Morgan Kaufmann. p. 5. ISBN 978-1-55860-489-6. "Thus, data mining should have been more appropriately named "knowledge mining from data," which is unfortunately somewhat long" 
  7. OKAIRP 2005 Fall Conference, Arizona State University -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期2014-02-01.
  8. Witten, Ian H.; Frank, Eibe; Hall, Mark A. (30 January 2011). Data Mining: Practical Machine Learning Tools and Techniques (3 ed.). Elsevier. ISBN 978-0-12-374856-0. 
  9. Bouckaert, Remco R.; Frank, Eibe; Hall, Mark A.; Holmes, Geoffrey; Pfahringer, Bernhard; Reutemann, Peter; Witten, Ian H. (2010). "WEKA Experiences with a Java open-source project". Journal of Machine Learning Research. 11: 2533–2541. the original title, "Practical machine learning", was changed ... The term "data mining" was [added] primarily for marketing reasons.
  10. Olson, D. L. (2007). Data mining in business services. Service Business, 1(3), 181-193. doi:10.1007/s11628-006-0014-7
  11. Lovell, Michael C. (1983). "Data Mining". The Review of Economics and Statistics. 65 (1): 1–12. doi:10.2307/1924403. JSTOR 1924403.
  12. Charemza, Wojciech W.; Deadman, Derek F. (1992). "Data Mining". New Directions in Econometric Practice. Aldershot: Edward Elgar. pp. 14–31. ISBN 1-85278-461-X. 
  13. Mena, Jesús (2011). Machine Learning Forensics for Law Enforcement, Security, and Intelligence. Boca Raton, FL: CRC Press (Taylor & Francis Group). ISBN 978-1-4398-6069-4. 
  14. Piatetsky-Shapiro, Gregory; Parker, Gary (2011). "Lesson: Data Mining, and Knowledge Discovery: An Introduction". Introduction to Data Mining. KD Nuggets. Retrieved 30 August 2012.
  15. Fayyad, Usama (15 June 1999). "First Editorial by Editor-in-Chief". SIGKDD Explorations. 13 (1): 102. doi:10.1145/2207243.2207269. Retrieved 27 December 2010.
  16. Kantardzic, Mehmed (2003). Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons. ISBN 978-0-471-22852-3. OCLC 50055336. https://archive.org/details/dataminingconcep0000kant. 
  17. Gregory Piatetsky-Shapiro (2002) KDnuggets Methodology Poll, Gregory Piatetsky-Shapiro (2004) KDnuggets Methodology Poll, Gregory Piatetsky-Shapiro (2007) KDnuggets Methodology Poll, Gregory Piatetsky-Shapiro (2014) KDnuggets Methodology Poll
  18. Lukasz Kurgan and Petr Musilek (2006); A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review. Volume 21 Issue 1, March 2006, pp 1–24, Cambridge University Press, New York, NY, USA 脚本错误:没有“Vorlage:Handle”这个模块。
  19. Azevedo, A. and Santos, M. F. KDD, SEMMA and CRISP-DM: a parallel overview -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期2013-01-09.. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182–185.
  20. Hawkins, Douglas M (2004). "The problem of overfitting". Journal of Chemical Information and Computer Sciences. 44 (1): 1–12. doi:10.1021/ci0342472. PMID 14741005.
  21. "Microsoft Academic Search: Top conferences in data mining". Microsoft Academic Search.
  22. "Google Scholar: Top publications - Data Mining & Analysis". Google Scholar.
  23. Proceedings -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期2010-04-30., International Conferences on Knowledge Discovery and Data Mining, ACM, New York.
  24. SIGKDD Explorations, ACM, New York.
  25. Günnemann, Stephan; Kremer, Hardy; Seidl, Thomas (2011). "An extension of the PMML standard to subspace clustering models". Proceedings of the 2011 workshop on Predictive markup language modeling - PMML '11. pp. 48. doi:10.1145/2023598.2023605. ISBN 978-1-4503-0837-3. 
  26. Seltzer, William (2005). "The Promise and Pitfalls of Data Mining: Ethical Issues" (PDF). ASA Section on Government Statistics. American Statistical Association.
  27. Pitts, Chip (15 March 2007). "The End of Illegal Domestic Spying? Don't Count on It". Washington Spectator. Archived from the original on 2007-11-28.
  28. Taipale, Kim A. (15 December 2003). "Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data". Columbia Science and Technology Law Review. 5 (2). OCLC 45263753. SSRN 546782.
  29. Resig, John. "A Framework for Mining Instant Messaging Services" (PDF). Retrieved 16 March 2018.
  30. 30.0 30.1 30.2 Think Before You Dig: Privacy Implications of Data Mining & Aggregation -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期2008-12-17., NASCIO Research Brief, September 2004
  31. Ohm, Paul. "Don't Build a Database of Ruin". Harvard Business Review.
  32. Darwin Bond-Graham, Iron Cagebook - The Logical End of Facebook's Patents, Counterpunch.org, 2013.12.03
  33. Darwin Bond-Graham, Inside the Tech industry's Startup Conference, Counterpunch.org, 2013.09.11
  34. AOL search data identified individuals, SecurityFocus, August 2006
  35. Kshetri, Nir (2014). "Big data׳s impact on privacy, security and consumer welfare" (PDF). Telecommunications Policy. 38 (11): 1134–1145. doi:10.1016/j.telpol.2014.10.002.
  36. Weiss, Martin A.; Archick, Kristin (19 May 2016). "U.S.-E.U. Data Privacy: From Safe Harbor to Privacy Shield" (PDF). Washington, D.C. Congressional Research Service. p. 6. R44257. Retrieved 9 April 2020. On October 6, 2015, the CJEU ... issued a decision that invalidated Safe Harbor (effective immediately), as currently implemented.
  37. Biotech Business Week Editors (June 30, 2008); BIOMEDICINE; HIPAA Privacy Rule Impedes Biomedical Research, Biotech Business Week, retrieved 17 November 2009 from LexisNexis Academic
  38. UK Researchers Given Data Mining Right Under New UK Copyright Laws. -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期June 9, 2014,. Out-Law.com. Retrieved 14 November 2014
  39. "Licences for Europe - Structured Stakeholder Dialogue 2013". European Commission. Retrieved 14 November 2014.
  40. "Text and Data Mining:Its importance and the need for change in Europe". Association of European Research Libraries. Retrieved 14 November 2014.
  41. "Judge grants summary judgment in favor of Google Books — a fair use victory". Lexology.com. Antonelli Law Ltd. Retrieved 14 November 2014.

进一步阅读 Further reading

相关链接External links

模板:Commons category


模板:Data

模板:Data warehouse

模板:Computer science

Category:Formal sciences

类别: 正规科学


This page was moved from wikipedia:en:Data mining. Its edit history can be viewed at 数据挖掘/edithistory