数据挖掘
此词条暂由彩云小译翻译,未经人工整理和审校,带来阅读不便,请见谅。
Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.[1] Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use.[1][2][3][4] Data mining is the analysis step of the "knowledge discovery in databases" process or KDD.[5] Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.[1]
Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is the analysis step of the "knowledge discovery in databases" process or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
数据挖掘 Data Mining 是在大型数据集中发现模式的过程,是一种涉及到机器学习、统计学和数据库系统综合使用的方法。数据挖掘是指“数据库中的知识发现 KDD”的过程的分析步骤。除了传统的分析步骤,它还涉及数据库和数据管理方面,包括数据预处理、模型和推理考虑、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。
The term "data mining" is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself.[6] It also is a buzzword[7] and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java[8] (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons.[9] Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.
The term "data mining" is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.
“数据挖掘”这种形容其实并不十分恰当,因为我们的目标是从大量数据中提取模式和知识,而不是数据本身的提取(挖掘)。它是一个流行语,经常用于任何形式的大规模数据或信息处理(收集、提取、仓储、分析和统计)的场景下,以及 计算机决策系统 Decision Support System,DSS的任何应用当中,包括人工智能(例如机器学习)和商业智能。《数据挖掘:使用Java的实用机器学习工具和技术》(主要涵盖机器学习材料)一书最初被命名为“实用机器学习”,而数据挖掘一词只是为了营销的原因而增加。经常更一般的术语例如(大规模)数据分析和分析——或当提到实际的方法时使用人工智能和机器学习这样的词语更加合适。
The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.
The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.
实际上数据挖掘任务是对大量数据进行半自动或全自动分析,以提取出从前未知的且有趣的模式,如数据记录组(数据聚类)、异常记录组(异常检测)和依赖关系(关联规则挖掘,序列挖掘)。这通常涉及使用数据库技术,如空间索引。这些模式可以被看作是输入数据的一种汇总,并且可以用于进一步的分析,或者,例如,机器学习和预测分析。例如,数据挖掘步骤可以识别数据中的多个组,然后可以使用该步骤通过决策支持系统获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分,而是作为附加步骤属于整个 KDD 过程。
如数据记录组(聚类分析 Cluster Analysis)、异常记录(异常检测 Anomaly Detection)和依赖关系(关联规则挖掘 Association Rule Mining、序列模式挖掘 Sequential Pattern Mining)。这通常涉及到使用数据库技术,如空间索引。这些模式可以被看作是输入数据的一种规律总结,可以用于进一步的分析,或者,例如,在机器学习和预测分析中。例如,通过数据挖掘可以出识别数据中的多个组,然后这些组可以通过使用决策支持系统来获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分,而是属于整个KDD过程的附加步骤。
The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.[10]
The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.
数据分析 Data Analysis和数据挖掘的区别在于,数据分析用于测试数据集上的模型和假设,例如,分析营销活动的有效性,而不考虑数据量的多少;相反,数据挖掘使用机器学习和统计模型来发现“大量”数据中的秘密或隐藏模式。
The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.
The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.
相关术语数据挖掘、数据捕捞和数据窥探是指使用数据挖掘方法对较大的人口数据集中的某些部分进行抽样,这些部分(或可能)太小,无法对所发现的任何模式的有效性做出可靠的统计推断。然而,这些方法可以用来创造新的假说,以测试较大的数据总体。
相关术语“数据疏浚” Data Dredging、“数据钓鱼”和“数据窥探”是指使用数据挖掘方法对较大的人口数据集中的一部分进行抽样,这些数据集太小(或可能太小),无法对所发现的任何模式的有效性作出可靠的统计推断。然而,这些方法可以用于创建新的假设,以针对更大的数据群体进行测试。
起源 Etymology
In the 1960s, statisticians and economists used terms like data fishing or data dredging to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term "data mining" was used in a similarly critical way by economist Michael Lovell in an article published in the Review of Economic Studies in 1983.[11][12] Lovell indicates that the practice "masquerades under a variety of aliases, ranging from "experimentation" (positive) to "fishing" or "snooping" (negative).
In the 1960s, statisticians and economists used terms like data fishing or data dredging to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term "data mining" was used in a similarly critical way by economist Michael Lovell in an article published in the Review of Economic Studies in 1983. Lovell indicates that the practice "masquerades under a variety of aliases, ranging from "experimentation" (positive) to "fishing" or "snooping" (negative).
在20世纪60年代,统计学家和经济学家使用数据捕捞或数据挖掘等术语来指代他们认为在没有先验假设的情况下分析数据的糟糕做法。经济学家迈克尔•洛弗尔(Michael Lovell)在1983年《经济研究评论》(Review of Economic Studies)上发表的一篇文章中,也以类似的批判方式使用了“数据挖掘”这个术语。Lovell 指出,这种做法“伪装成各种别名,从“实验”(正面)到“钓鱼”或“窥探”(负面)。
The term data mining appeared around 1990 in the database community, generally with positive connotations. For a short time in 1980s, a phrase "database mining"™, was used, but since it was trademarked by HNC, a San Diego-based company, to pitch their Database Mining Workstation;[13] researchers consequently turned to data mining. Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities.[14] Currently, the terms data mining and knowledge discovery are used interchangeably.
The term data mining appeared around 1990 in the database community, generally with positive connotations. For a short time in 1980s, a phrase "database mining"™, was used, but since it was trademarked by HNC, a San Diego-based company, to pitch their Database Mining Workstation; researchers consequently turned to data mining. Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities. Currently, the terms data mining and knowledge discovery are used interchangeably.
数据挖掘这个术语在1990年左右出现在数据库领域,通常有着积极的内涵。在20世纪80年代的一段短暂时间里,人们使用了“数据库挖掘”这个短语,但由于这个短语被总部位于圣地亚哥的 HNC 公司注册为商标,因此研究人员转向了数据挖掘。使用的其他术语包括数据考古学、信息收集、信息发现、知识提取等。Gregory Piatetsky-Shapiro 在关于同一主题的第一个研讨会上创造了“数据库中的知识发现”这个术语[ http://www.kdnuggets.com/meetings/kdd89/ (KDD-1989)] ,这个术语在人工智能和机器学习社区中变得更加流行。然而,数据挖掘这个术语在商业和出版界变得越来越流行。目前,数据挖掘和知识发现这两个术语可以互换使用。
In the academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery (KDD-95) was started in Montreal under AAAI sponsorship. It was co-chaired by Usama Fayyad and Ramasamy Uthurusamy. A year later, in 1996, Usama Fayyad launched the journal by Kluwer called Data Mining and Knowledge Discovery as its founding editor-in-chief. Later he started the SIGKDD Newsletter SIGKDD Explorations.[15] The KDD International conference became the primary highest quality conference in data mining with an acceptance rate of research paper submissions below 18%. The journal Data Mining and Knowledge Discovery is the primary research journal of the field.
In the academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery (KDD-95) was started in Montreal under AAAI sponsorship. It was co-chaired by Usama Fayyad and Ramasamy Uthurusamy. A year later, in 1996, Usama Fayyad launched the journal by Kluwer called Data Mining and Knowledge Discovery as its founding editor-in-chief. Later he started the SIGKDD Newsletter SIGKDD Explorations. The KDD International conference became the primary highest quality conference in data mining with an acceptance rate of research paper submissions below 18%. The journal Data Mining and Knowledge Discovery is the primary research journal of the field.
在学术界,主要的研究论坛始于1995年,当时在 AAAI 的赞助下在蒙特利尔举行了第一次数据挖掘和知识发现国际会议(KDD-95)。会议由乌萨马 · 法耶兹和拉马萨米 · 乌图鲁萨米共同主持。一年后,也就是1996年,乌萨马 · 法耶兹(Usama Fayyad)创办了这本由克卢维尔(Kluwer)撰写的期刊,并将其命名为《数据挖掘与知识发现》(Data Mining and Knowledge Discovery) ,担任创刊主编。后来他创办了 SIGKDD 时事通讯 SIGKDD Explorations。Kdd 国际会议成为数据挖掘领域最高质量的会议,研究论文的接受率低于18% 。《数据挖掘与知识发现》是该领域的主要研究期刊。
背景 Background
The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology have dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, specially in the field of machine learning, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns[16] in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever-larger data sets.
The manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology have dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, specially in the field of machine learning, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever-larger data sets.
从数据中手工提取图案已经发生了几个世纪了。早期识别数据中模式的方法包括贝叶斯定理(17世纪)和回归分析定理(19世纪)。计算机技术的扩散、普及和不断增强的能力极大地提高了数据的收集、存储和操作能力。随着数据集的规模和复杂性的增长,直接的“实际操作”数据分析越来越多地借助于间接的、自动化的数据处理,辅之以计算机科学领域的其他发现,特别是在机器学习领域,如神经网络、数据聚类、遗传算法(1950年代)、决策树和决策规则(1960年代) ,以及支持向量机(1990年代)。数据挖掘就是应用这些方法来发现大型数据集中的隐藏模式的过程。它利用数据在数据库中存储和索引的方式,更有效地执行实际的学习和发现算法,从而弥补了从应用统计学和人工智能(通常提供数学背景)到数据库管理之间的差距,使这些方法能够应用于更大的数据集。
发展过程 Process
The knowledge discovery in databases (KDD) process is commonly defined with the stages:
The knowledge discovery in databases (KDD) process is commonly defined with the stages:
数据库中的知识发现 Knowledge Discovery in Databases ,KDD过程通常定义为以下几个阶段:
- Selection
Selection
选择
- Pre-processing
Pre-processing
预处理
- Transformation
Transformation
转变
- Data mining
Data mining
数据挖掘
- Interpretation/evaluation.[5]
Interpretation/evaluation.
口译 / 评估。
It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:
It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:
然而,它存在于这个主题的许多变体中,例如用于数据挖掘的跨行业标准流程(CRISP-DM) ,它定义了六个阶段:
- Business understanding
Business understanding
商业理解
- Data understanding
Data understanding
数据理解
- Data preparation
Data preparation
数据准备
- Modeling
Modeling
模特
- Evaluation
Evaluation
评估
- Deployment
Deployment
部署
or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.
or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.
或一个简化的过程,如(1)预处理,(2)数据挖掘,(3)结果验证。
Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners.[17] The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models,[18] and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.[19]
Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners. The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models, and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.
2002年、2004年、2007年和2014年进行的调查显示,CRISP-DM 方法是数据挖掘者使用的主要方法。在这些民意测验中唯一的其他数据挖掘标准是 SEMMA。然而,使用 CRISP-DM 的人数是其他人的3-4倍。几个研究团队发表了数据挖掘过程模型的评论,阿泽维多和桑托斯在2008年对 CRISP-DM 和 sema 进行了比较。
预处理 Pre-processing
Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.
Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.
在使用数据挖掘算法之前,必须先组装目标数据集。由于数据挖掘只能发现数据中实际存在的模式,目标数据集必须足够大以包含这些模式,同时保持足够简洁以便在可接受的时间限制内进行挖掘。数据的公共源是数据集市或数据仓库。在数据挖掘之前,对多变量数据集进行预处理是必不可少的。然后清理目标集。数据清理去除了包含噪声的观测值和缺失数据的观测值。
数据挖掘 Data mining
Data mining involves six common classes of tasks:[5]
Data mining involves six common classes of tasks:
数据挖掘涉及六类常见的任务:
- Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.
异常检测 Anomaly detection(异常值/变化/偏差检测):识别异常数据记录,可能是有趣的或需要进一步调查的数据错误。
- Association rule learning (dependency modeling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
关联规则学习 Association rule learning(依赖关系建模):搜索变量之间的关系。例如,超市可能会收集顾客购买习惯的数据。使用关联规则学习,超市可以确定哪些产品经常一起购买,并将这些信息用于营销目的。这有时被称为市场篮子分析。
- Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
聚类 Clustering:是指在数据中发现以某种方式或其他方式“相似”的组和结构,而不使用数据中已知的结构。
- Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
- Regression – attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets.
- Summarization – providing a more compact representation of the data set, including visualization and report generation.
结果验证 Results validation
An example of data produced by [[data dredging through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence.]]
一个由[统计学家泰勒 · 维根操作的机器人数据挖掘]产生的数据的例子,显然显示了在拼字比赛中获胜的最佳单词与美国被毒蜘蛛杀死的人数之间的密切联系。这种趋势的相似性显然是一种巧合
Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.[20]
Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.
数据挖掘可能无意中被误用,然后可能产生看起来重要的结果; 但这些结果实际上并不能预测未来的行为,不能在新的数据样本上重现,而且几乎没有用处。这通常是由于调查了太多的假设,而没有进行适当的统计假设检验。在机器学习中,这个问题的一个简单版本被称为过度拟合,但是同样的问题可能出现在过程的不同阶段,因此,列车 / 测试的拆分(如果适用的话)可能不足以防止这种情况的发生。
The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as ROC curves.
The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as ROC curves.
从数据中发现知识的最后一步是验证数据挖掘算法产生的模式是否存在于更广泛的数据集中。并非数据挖掘算法发现的所有模式都必然有效。数据挖掘算法通常在训练集中发现一般数据集中不存在的模式。这就是所谓的过度装配。为了克服这个问题,评估使用一组测试数据,而数据挖掘算法没有在这些测试数据上进行训练。学习的模式应用到这个测试集中,并将结果输出与所需的输出进行比较。例如,试图区分“垃圾邮件”和“合法”邮件的数据挖掘算法将根据一组样本电子邮件进行训练。一旦经过训练,学到的模式将应用于未经训练的电子邮件测试集。模式的准确性可以通过他们正确分类的电子邮件数量来衡量。几种统计方法可以用来评估算法,如 ROC 曲线。
If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.
If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.
如果学习的模式不能达到预期的标准,那么就需要重新评估和修改预处理和数据挖掘的步骤。如果所学的模式确实符合所需的标准,那么最后一步就是解释所学的模式并将其转化为知识。
研究 Research
The premier professional body in the field is the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD).[21][22] Since 1989, this ACM SIG has hosted an annual international conference and published its proceedings,[23] and since 1999 it has published a biannual academic journal titled "SIGKDD Explorations".[24]
The premier professional body in the field is the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD). Since 1989, this ACM SIG has hosted an annual international conference and published its proceedings, and since 1999 it has published a biannual academic journal titled "SIGKDD Explorations".
该领域的主要专业机构是计算机协会(ACM)知识发现和数据挖掘特殊兴趣小组(SIG)。自1989年以来,ACM SIG 主办了一次年度国际会议并出版了会议记录,自1999年以来,它出版了一份两年期的学术期刊,题为“ SIGKDD Explorations”。
Computer science conferences on data mining include:
Computer science conferences on data mining include:
关于数据挖掘的计算机科学会议包括:
Data mining topics are also present on many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases
Data mining topics are also present on many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases
数据挖掘专题也出现在许多数据管理 / 数据库会议上,如 ICDE 会议、 SIGMOD 会议和甚大数据库国际会议
标准 Standards
There have been some efforts to define standards for the data mining process, for example, the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006 but has stalled since. JDM 2.0 was withdrawn without reaching a final draft.
There have been some efforts to define standards for the data mining process, for example, the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006 but has stalled since. JDM 2.0 was withdrawn without reaching a final draft.
已经有一些工作为数据挖掘过程定义标准,例如,1999年欧洲跨行业数据挖掘标准过程(CRISP-DM 1.0)和2004年 Java 数据挖掘标准(JDM 1.0)。这些程序的后续程序(CRISP-DM 2.0和 JDM 2.0)的开发活跃于2006年,但此后一直停滞不前。Jdm 2.0没有达成最终草案就被撤销了。
For exchanging the extracted models – in particular for use in predictive analytics – the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.[25]
For exchanging the extracted models – in particular for use in predictive analytics – the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.
为了交换所提取的模型,特别是在预测分析中使用,关键的标准是预测模型标记语言(PMML) ,这是一种基于 xml 的语言,由数据挖掘集团(DMG)开发,并支持作为交换格式的许多数据挖掘应用程序。顾名思义,它只涵盖预测模型,这是一项对业务应用程序非常重要的特殊数据挖掘任务。然而,覆盖子空间聚类的扩展(例如)已经独立于 DMG 被提出。
显著用途 Notable uses
Data mining is used wherever there is digital data available today. Notable examples of data mining can be found throughout business, medicine, science, and surveillance.
Data mining is used wherever there is digital data available today. Notable examples of data mining can be found throughout business, medicine, science, and surveillance.
数据挖掘在任何有数字数据可用的地方都被使用。数据挖掘的著名例子可以在商业、医学、科学和监控领域找到。
Privacy concerns and ethics
While the term "data mining" itself may have no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise).[26]
While the term "data mining" itself may have no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise).
虽然“数据挖掘”这个术语本身可能没有伦理含义,但它通常与人们行为(伦理和其他)相关的信息挖掘有关。
The ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics.[27] In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.[28][29]
The ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics. In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.
在某些情况下,数据挖掘的使用方式会引起关于隐私、合法性和道德的问题。特别是,出于国家安全或执法目的的数据挖掘政府或商业数据集,如在全面信息意识项目或在 ADVISE 中,引起了隐私问题。
Data mining requires data preparation which uncovers information or patterns which compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent).[30] This is not data mining per se, but a result of the preparation of data before – and for the purposes of – the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.[31][32][33]
Data mining requires data preparation which uncovers information or patterns which compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent). This is not data mining per se, but a result of the preparation of data before – and for the purposes of – the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.
数据挖掘需要进行数据准备,以发现损害机密性和隐私义务的信息或模式。发生这种情况的一种常见方式是通过数据聚合。数据聚合涉及以一种有利于分析的方式将数据组合在一起(可能来自不同的来源)(但这也可能使私有的、个人级别的数据的识别可以推断或以其他方式显而易见)。这本身并不是数据挖掘,而是在分析之前准备数据的结果,也是为了分析的目的。对个人隐私的威胁发挥作用时,数据,一旦编译,使数据矿工,或任何人谁有权访问新编译的数据集,能够识别具体的个人,特别是当数据最初是匿名的。
It is recommended模板:Whom to be aware of the following before data are collected:[30]
It is recommended to be aware of the following before data are collected:
在收集数据之前,建议注意以下事项:
- The purpose of the data collection and any (known) data mining projects;
- How the data will be used;
- Who will be able to mine the data and use the data and their derivatives;
- The status of security surrounding access to the data;
- How collected data can be updated.
Data may also be modified so as to become anonymous, so that individuals may not readily be identified.[30] However, even ""anonymized" data sets can potentially contain enough information to allow identification of individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL.[34]
Data may also be modified so as to become anonymous, so that individuals may not readily be identified.
数据也可能被修改成匿名的,这样个人就不容易被识别。
The inadvertent revelation of personally identifiable information leading to the provider violates Fair Information Practices. This indiscretion can cause financial,
The inadvertent revelation of personally identifiable information leading to the provider violates Fair Information Practices. This indiscretion can cause financial,
无意中泄露的个人身份信息信息导致供应商违反了公平信息惯例。这种轻率的行为会导致经济上的,
emotional, or bodily harm to the indicated individual. In one instance of privacy violation, the patrons of Walgreens filed a lawsuit against the company in 2011 for selling
emotional, or bodily harm to the indicated individual. In one instance of privacy violation, the patrons of Walgreens filed a lawsuit against the company in 2011 for selling
对指定个人的情感或身体伤害。在一起侵犯隐私的案例中,沃尔格林的赞助人在2011年对沃尔格林公司提起诉讼,指控其销售
prescription information to data mining companies who in turn provided the data
prescription information to data mining companies who in turn provided the data
处方信息提供给数据挖掘公司,而这些公司反过来又提供了数据
to pharmaceutical companies.[35]
to pharmaceutical companies.
给制药公司。
Situation in Europe
Europe has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.-E.U. Safe Harbor Principles, developed between 1998 and 2000, currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of Edward Snowden's global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement with the United States have failed.[36]
Europe has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.-E.U. Safe Harbor Principles, developed between 1998 and 2000, currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of Edward Snowden's global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement with the United States have failed.
欧洲有相当严格的隐私法,并且正在努力进一步加强消费者的权利。然而,美国和欧盟。1998年至2000年间开发的安全港原则,目前有效地将欧洲用户暴露在美国公司的隐私剥削之下。由于爱德华 · 斯诺登(Edward Snowden)披露了全球监控信息,撤销这项协议的讨论有所增加,特别是数据将完全暴露给美国国家安全局(National Security Agency) ,与美国达成协议的尝试也失败了。
Situation in the United States
In the United States, privacy concerns have been addressed by the US Congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in Biotech Business Week, "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals."[37] This underscores the necessity for data anonymity in data aggregation and mining practices.
In the United States, privacy concerns have been addressed by the US Congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in Biotech Business Week, "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals." This underscores the necessity for data anonymity in data aggregation and mining practices.
在美国,隐私问题已经通过美国国会通过的监管控制措施得到解决,比如美国健康保险便利和责任法案保护局(HIPAA)。该法要求个人就其提供的信息及其目前和未来的预期用途作出”知情同意”。根据《生物技术商业周刊》的一篇文章,“在实践中,HIPAA 可能不会提供任何比研究领域长期存在的规定更好的保护,” AAHC 说。更重要的是,该规则的目标是通过知情同意的保护是接近一般个人不可理解的水平。”这强调了数据聚合和挖掘实践中数据匿名的必要性。
U.S. information privacy legislation such as HIPAA and the Family Educational Rights and Privacy Act (FERPA) applies only to the specific areas that each such law addresses. The use of data mining by the majority of businesses in the U.S. is not controlled by any legislation.
U.S. information privacy legislation such as HIPAA and the Family Educational Rights and Privacy Act (FERPA) applies only to the specific areas that each such law addresses. The use of data mining by the majority of businesses in the U.S. is not controlled by any legislation.
美国信息隐私立法,如 HIPAA 和《家庭教育权利和隐私法》(FERPA)仅适用于这些法律所涉及的具体领域。美国大多数企业对数据挖掘的使用并不受任何法律的控制。
数据挖掘与著作权法 Copyright law
欧洲 Situation in Europe
Under European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is not legal. Where a database is pure data in Europe, it may be that there is no copyright模板:Snd but database rights may exist so data mining becomes subject to intellectual property owners' rights that are protected by the Database Directive. On the recommendation of the Hargreaves review, this led to the UK government to amend its copyright law in 2014 to allow content mining as a limitation and exception.[38] The UK was the second country in the world to do so after Japan, which introduced an exception in 2009 for data mining. However, due to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law also does not allow this provision to be overridden by contractual terms and conditions.
Under European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is not legal. Where a database is pure data in Europe, it may be that there is no copyright but database rights may exist so data mining becomes subject to intellectual property owners' rights that are protected by the Database Directive. On the recommendation of the Hargreaves review, this led to the UK government to amend its copyright law in 2014 to allow content mining as a limitation and exception. The UK was the second country in the world to do so after Japan, which introduced an exception in 2009 for data mining. However, due to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law also does not allow this provision to be overridden by contractual terms and conditions.
根据欧洲的版权和数据库法律,未经版权所有者的许可对版权内作品(如 web 挖掘)进行挖掘是不合法的。在欧洲,数据库是纯粹的数据,可能没有版权,但数据库权利可能存在,因此数据挖掘受到受数据库指令保护的知识产权所有者权利的约束。根据 Hargreaves 审查的建议,这导致英国政府在2014年修订其版权法,允许内容挖掘作为一种限制和例外。英国是继日本之后第二个这样做的国家,日本在2009年引入了数据挖掘的例外。然而,由于信息社会指令(2001)的限制,英国的例外只允许非商业目的的内容挖掘。英国版权法也不允许合同条款和条件推翻这一规定。
The European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe.[39] The focus on the solution to this legal issue, such as licensing rather than limitations and exceptions, led to representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.[40]
The European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe. The focus on the solution to this legal issue, such as licensing rather than limitations and exceptions, led to representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.
2013年,欧洲委员会促进了利益攸关方在欧洲许可证标题下关于文本和数据挖掘的讨论。将重点放在解决这一法律问题上,如许可证而不是限制和例外,导致大学、研究人员、图书馆、民间社会团体和开放存取出版商的代表在2013年5月离开了利益攸关方对话。
美国 Situation in the United States
US copyright law, and in particular its provision for fair use, upholds the legality of content mining in America, and other fair use countries such as Israel, Taiwan and South Korea. As content mining is transformative, that is it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed - one being text and data mining.[41]
US copyright law, and in particular its provision for fair use, upholds the legality of content mining in America, and other fair use countries such as Israel, Taiwan and South Korea. As content mining is transformative, that is it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed - one being text and data mining.
美国版权法,特别是其中关于合理使用的条款,支持在美国和其他合理使用国家,如以色列,台湾和韩国采矿内容的合法性。由于内容挖掘是变革性的,也就是说,它不会取代原来的工作,它被视为合法的合理使用。例如,作为谷歌图书和解协议的一部分,此案的主审法官裁定,谷歌版权图书数字化项目是合法的,部分原因在于数字化项目所展示的变革性用途——其中之一就是文本和数据挖掘。
软件 Software
开源的数据挖掘软件 Free open-source data mining software and applications
The following applications are available under free/open-source licenses. Public access to application source code is also available.
The following applications are available under free/open-source licenses. Public access to application source code is also available.
下面的应用程序可以使用免费 / 开放源码许可证。应用程序源代码的公共访问也是可用的。
- Carrot2: Text and search results clustering framework.
- Chemicalize.org: A chemical structure miner and web search engine.
- ELKI: A university research project with advanced cluster analysis and outlier detection methods written in the Java language.
- GATE: a natural language processing and language engineering tool.
- KNIME: The Konstanz Information Miner, a user-friendly and comprehensive data analytics framework.
- Massive Online Analysis (MOA): a real-time big data stream mining with concept drift tool in the Java programming language.
- MEPX - cross-platform tool for regression and classification problems based on a Genetic Programming variant.
- ML-Flex: A software package that enables users to integrate with third-party machine-learning packages written in any programming language, execute classification analyses in parallel across multiple computing nodes, and produce HTML reports of classification results.
- NLTK (Natural Language Toolkit): A suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python language.
- OpenNN: Open neural networks library.
- Orange: A component-based data mining and machine learning software suite written in the Python language.
- R: A programming language and software environment for statistical computing, data mining, and graphics. It is part of the GNU Project.
- scikit-learn is an open-source machine learning library for the Python programming language
- Torch: An open-source deep learning library for the Lua programming language and scientific computing framework with wide support for machine learning algorithms.
- UIMA: The UIMA (Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video – originally developed by IBM.
需要专有许可的数据挖掘软件和应用程序 Proprietary data-mining software and applications
The following applications are available under proprietary licenses.
The following applications are available under proprietary licenses.
下面的应用程序可以根据专有许可证提供。
- Angoss KnowledgeSTUDIO: data mining tool
- LIONsolver: an integrated software application for data mining, business intelligence, and modeling that implements the Learning and Intelligent OptimizatioN (LION) approach.
- Megaputer Intelligence: data and text mining software is called PolyAnalyst.
- Microsoft Analysis Services: data mining software provided by Microsoft.
- NetOwl: suite of multilingual text and entity analytics products that enable data mining.
- Oracle Data Mining: data mining software by Oracle Corporation.
- PSeven: platform for automation of engineering simulation and analysis, multidisciplinary optimization and data mining provided by DATADVANCE.
- Qlucore Omics Explorer: data mining software.
- RapidMiner: An environment for machine learning and data mining experiments.
- SAS Enterprise Miner: data mining software provided by the SAS Institute.
- SPSS Modeler: data mining software provided by IBM.
- STATISTICA Data Miner: data mining software provided by StatSoft.
- Tanagra: Visualisation-oriented data mining software, also for teaching.
- Vertica: data mining software provided by Hewlett-Packard.
扩展链接 See also
- Methods
Methods
方法
}}
- Application domains
Application domains
应用程序域
}}
- Application examples
Application examples
应用实例
}}
- Related topics
Related topics
相关话题
For more information about extracting information out of data (as opposed to analyzing data) , see:
For more information about extracting information out of data (as opposed to analyzing data) , see:
有关从数据中提取信息(与分析数据相反)的详细信息,请参阅:
}}
- Other resources
Other resources
其他资源
参考文献 References
- ↑ 1.0 1.1 1.2 "Data Mining Curriculum". ACM SIGKDD. 2006-04-30. Retrieved 2014-01-27.
- ↑ Clifton, Christopher (2010). "Encyclopædia Britannica: Definition of Data Mining". Retrieved 2010-12-09.
- ↑ Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2009). "The Elements of Statistical Learning: Data Mining, Inference, and Prediction". Archived from the original on 2009-11-10. Retrieved 2012-08-07.
- ↑ Han, Kamber, Pei, Jaiwei, Micheline, Jian (June 9, 2011). Data Mining: Concepts and Techniques (3rd ed.). Morgan Kaufmann. ISBN 978-0-12-381479-1.
- ↑ 5.0 5.1 5.2 Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996). "From Data Mining to Knowledge Discovery in Databases" (PDF). Retrieved 17 December 2008.
- ↑ Han, Jiawei; Kamber, Micheline (2001). Data mining: concepts and techniques. Morgan Kaufmann. p. 5. ISBN 978-1-55860-489-6. "Thus, data mining should have been more appropriately named "knowledge mining from data," which is unfortunately somewhat long"
- ↑ OKAIRP 2005 Fall Conference, Arizona State University -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-的存檔,存档日期2014-02-01.
- ↑ Witten, Ian H.; Frank, Eibe; Hall, Mark A. (30 January 2011). Data Mining: Practical Machine Learning Tools and Techniques (3 ed.). Elsevier. ISBN 978-0-12-374856-0.
- ↑ Bouckaert, Remco R.; Frank, Eibe; Hall, Mark A.; Holmes, Geoffrey; Pfahringer, Bernhard; Reutemann, Peter; Witten, Ian H. (2010). "WEKA Experiences with a Java open-source project". Journal of Machine Learning Research. 11: 2533–2541.
the original title, "Practical machine learning", was changed ... The term "data mining" was [added] primarily for marketing reasons.
- ↑ Olson, D. L. (2007). Data mining in business services. Service Business, 1(3), 181-193. doi:10.1007/s11628-006-0014-7
- ↑ Lovell, Michael C. (1983). "Data Mining". The Review of Economics and Statistics. 65 (1): 1–12. doi:10.2307/1924403. JSTOR 1924403.
- ↑ Charemza, Wojciech W.; Deadman, Derek F. (1992). "Data Mining". New Directions in Econometric Practice. Aldershot: Edward Elgar. pp. 14–31. ISBN 1-85278-461-X.
- ↑ Mena, Jesús (2011). Machine Learning Forensics for Law Enforcement, Security, and Intelligence. Boca Raton, FL: CRC Press (Taylor & Francis Group). ISBN 978-1-4398-6069-4.
- ↑ Piatetsky-Shapiro, Gregory; Parker, Gary (2011). "Lesson: Data Mining, and Knowledge Discovery: An Introduction". Introduction to Data Mining. KD Nuggets. Retrieved 30 August 2012.
- ↑ Fayyad, Usama (15 June 1999). "First Editorial by Editor-in-Chief". SIGKDD Explorations. 13 (1): 102. doi:10.1145/2207243.2207269. Retrieved 27 December 2010.
- ↑ Kantardzic, Mehmed (2003). Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons. ISBN 978-0-471-22852-3. OCLC 50055336. https://archive.org/details/dataminingconcep0000kant.
- ↑ Gregory Piatetsky-Shapiro (2002) KDnuggets Methodology Poll, Gregory Piatetsky-Shapiro (2004) KDnuggets Methodology Poll, Gregory Piatetsky-Shapiro (2007) KDnuggets Methodology Poll, Gregory Piatetsky-Shapiro (2014) KDnuggets Methodology Poll
- ↑ Lukasz Kurgan and Petr Musilek (2006); A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review. Volume 21 Issue 1, March 2006, pp 1–24, Cambridge University Press, New York, NY, USA 脚本错误:没有“Vorlage:Handle”这个模块。
- ↑ Azevedo, A. and Santos, M. F. KDD, SEMMA and CRISP-DM: a parallel overview -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-的存檔,存档日期2013-01-09.. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182–185.
- ↑ Hawkins, Douglas M (2004). "The problem of overfitting". Journal of Chemical Information and Computer Sciences. 44 (1): 1–12. doi:10.1021/ci0342472. PMID 14741005.
- ↑ "Microsoft Academic Search: Top conferences in data mining". Microsoft Academic Search.
- ↑ "Google Scholar: Top publications - Data Mining & Analysis". Google Scholar.
- ↑ Proceedings -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-的存檔,存档日期2010-04-30., International Conferences on Knowledge Discovery and Data Mining, ACM, New York.
- ↑ SIGKDD Explorations, ACM, New York.
- ↑ Günnemann, Stephan; Kremer, Hardy; Seidl, Thomas (2011). "An extension of the PMML standard to subspace clustering models". Proceedings of the 2011 workshop on Predictive markup language modeling - PMML '11. pp. 48. doi:10.1145/2023598.2023605. ISBN 978-1-4503-0837-3.
- ↑ Seltzer, William (2005). "The Promise and Pitfalls of Data Mining: Ethical Issues" (PDF). ASA Section on Government Statistics. American Statistical Association.
- ↑ Pitts, Chip (15 March 2007). "The End of Illegal Domestic Spying? Don't Count on It". Washington Spectator. Archived from the original on 2007-11-28.
- ↑ Taipale, Kim A. (15 December 2003). "Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data". Columbia Science and Technology Law Review. 5 (2). OCLC 45263753. SSRN 546782.
- ↑ Resig, John. "A Framework for Mining Instant Messaging Services" (PDF). Retrieved 16 March 2018.
- ↑ 30.0 30.1 30.2 Think Before You Dig: Privacy Implications of Data Mining & Aggregation -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-的存檔,存档日期2008-12-17., NASCIO Research Brief, September 2004
- ↑ Ohm, Paul. "Don't Build a Database of Ruin". Harvard Business Review.
- ↑ Darwin Bond-Graham, Iron Cagebook - The Logical End of Facebook's Patents, Counterpunch.org, 2013.12.03
- ↑ Darwin Bond-Graham, Inside the Tech industry's Startup Conference, Counterpunch.org, 2013.09.11
- ↑ AOL search data identified individuals, SecurityFocus, August 2006
- ↑ Kshetri, Nir (2014). "Big data׳s impact on privacy, security and consumer welfare" (PDF). Telecommunications Policy. 38 (11): 1134–1145. doi:10.1016/j.telpol.2014.10.002.
- ↑ Weiss, Martin A.; Archick, Kristin (19 May 2016). "U.S.-E.U. Data Privacy: From Safe Harbor to Privacy Shield" (PDF). Washington, D.C. Congressional Research Service. p. 6. R44257. Retrieved 9 April 2020.
On October 6, 2015, the CJEU ... issued a decision that invalidated Safe Harbor (effective immediately), as currently implemented.
- ↑ Biotech Business Week Editors (June 30, 2008); BIOMEDICINE; HIPAA Privacy Rule Impedes Biomedical Research, Biotech Business Week, retrieved 17 November 2009 from LexisNexis Academic
- ↑ UK Researchers Given Data Mining Right Under New UK Copyright Laws. -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-的存檔,存档日期June 9, 2014,. Out-Law.com. Retrieved 14 November 2014
- ↑ "Licences for Europe - Structured Stakeholder Dialogue 2013". European Commission. Retrieved 14 November 2014.
- ↑ "Text and Data Mining:Its importance and the need for change in Europe". Association of European Research Libraries. Retrieved 14 November 2014.
- ↑ "Judge grants summary judgment in favor of Google Books — a fair use victory". Lexology.com. Antonelli Law Ltd. Retrieved 14 November 2014.
进一步阅读 Further reading
- Cabena, Peter; Hadjnian, Pablo; Stadler, Rolf; Verhees, Jaap; Zanasi, Alessandro (1997); Discovering Data Mining: From Concept to Implementation, Prentice Hall,
- M.S. Chen, J. Han, P.S. Yu (1996) "Data mining: an overview from a database perspective". Knowledge and data Engineering, IEEE Transactions on 8 (6), 866–883
- Feldman, Ronen; Sanger, James (2007); The Text Mining Handbook, Cambridge University Press,
- Guo, Yike; and Grossman, Robert (editors) (1999); High Performance Data Mining: Scaling Algorithms, Applications and Systems, Kluwer Academic Publishers
- Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques. Morgan kaufmann, 2006.
- Hastie, Trevor, Tibshirani, Robert and Friedman, Jerome (2001); The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer,
- Liu, Bing (2007, 2011); Web Data Mining: Exploring Hyperlinks, Contents and Usage Data, Springer,
- Murphy, Chris (16 May 2011). "Is Data Mining Free Speech?". InformationWeek: 12.
- Nisbet, Robert; Elder, John; Miner, Gary (2009); Handbook of Statistical Analysis & Data Mining Applications, Academic Press/Elsevier,
- Poncelet, Pascal; Masseglia, Florent; and Teisseire, Maguelonne (editors) (October 2007); "Data Mining Patterns: New Methods and Applications", Information Science Reference,
- Tan, Pang-Ning; Steinbach, Michael; and Kumar, Vipin (2005); Introduction to Data Mining,
- Theodoridis, Sergios; and Koutroumbas, Konstantinos (2009); Pattern Recognition, 4th Edition, Academic Press,
- Weiss, Sholom M.; and Indurkhya, Nitin (1998); Predictive Data Mining, Morgan Kaufmann
- Witten, Ian H.; Frank, Eibe; Hall, Mark A. (30 January 2011). Data Mining: Practical Machine Learning Tools and Techniques (3 ed.). Elsevier. ISBN 978-0-12-374856-0. (See also Free Weka software)
- Ye, Nong (2003); The Handbook of Data Mining, Mahwah, NJ: Lawrence Erlbaum
- Theodoridis, Sergios; and Koutroumbas, Konstantinos (2009); Pattern Recognition, 4th Edition, Academic Press,
- Tan, Pang-Ning; Steinbach, Michael; and Kumar, Vipin (2005); Introduction to Data Mining,
- Poncelet, Pascal; Masseglia, Florent; and Teisseire, Maguelonne (editors) (October 2007); "Data Mining Patterns: New Methods and Applications", Information Science Reference,
- Liu, Bing (2007, 2011); Web Data Mining: Exploring Hyperlinks, Contents and Usage Data, Springer,
相关链接External links
Category:Formal sciences
类别: 正规科学
This page was moved from wikipedia:en:Data mining. Its edit history can be viewed at 数据挖掘/edithistory