更改

跳到导航 跳到搜索
删除28,630字节 、 2020年9月22日 (二) 23:19
无编辑摘要
第40行: 第40行:       −
==起源 Etymology==
+
==起源==
    
在20世纪60年代,统计学家和经济学家们曾经使用“数据钓鱼”或”数据疏浚“等术语来指代他们认为在没有先验假设的情况下进行数据分析的糟糕做法。经济学家迈克尔•洛弗尔 Michael Lovell 在1983年<ref>{{Cite journal|last=Lovell|first=Michael C.|date=1983|title=Data Mining|journal=The Review of Economics and Statistics|volume=65|issue=1|pages=1–12|doi=10.2307/1924403|jstor=1924403}}</ref><ref>{{cite book |first=Wojciech W. |last=Charemza |first2=Derek F. |last2=Deadman |title=New Directions in Econometric Practice |location=Aldershot |publisher=Edward Elgar |year=1992 |chapter=Data Mining |pages=14–31 |isbn=1-85278-461-X }}</ref>《经济研究评论》(Review of Economic Studies)上发表的一篇文章中,也以类似的批判方式使用了“数据挖掘”这个术语。Lovell 指出,这种做法有很多别名,比如正面说法"实验",负面说法“钓鱼”、“窥探”等。
 
在20世纪60年代,统计学家和经济学家们曾经使用“数据钓鱼”或”数据疏浚“等术语来指代他们认为在没有先验假设的情况下进行数据分析的糟糕做法。经济学家迈克尔•洛弗尔 Michael Lovell 在1983年<ref>{{Cite journal|last=Lovell|first=Michael C.|date=1983|title=Data Mining|journal=The Review of Economics and Statistics|volume=65|issue=1|pages=1–12|doi=10.2307/1924403|jstor=1924403}}</ref><ref>{{cite book |first=Wojciech W. |last=Charemza |first2=Derek F. |last2=Deadman |title=New Directions in Econometric Practice |location=Aldershot |publisher=Edward Elgar |year=1992 |chapter=Data Mining |pages=14–31 |isbn=1-85278-461-X }}</ref>《经济研究评论》(Review of Economic Studies)上发表的一篇文章中,也以类似的批判方式使用了“数据挖掘”这个术语。Lovell 指出,这种做法有很多别名,比如正面说法"实验",负面说法“钓鱼”、“窥探”等。
第60行: 第60行:       −
==背景 Background==
+
==背景==
      第104行: 第104行:  
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】开头添加“2002、2004、2007、2014年的调查显示,CRISP-DM标准是数据挖掘者最常用的标准”
 
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】开头添加“2002、2004、2007、2014年的调查显示,CRISP-DM标准是数据挖掘者最常用的标准”
   −
===预处理 Pre-processing===
+
===预处理===
    
在使用数据挖掘算法之前,必须先对目标数据集进行整合。由于数据挖掘只能发现数据中实际存在的模式,目标数据集必须足够大以包含这些模式,同时保持足够简洁以便在可接受的时间限制内进行挖掘。数据的公共源是数据集市或数据仓库。在数据挖掘之前,对多变量数据集进行预处理是必不可少的。然后清理目标集。数据清理去除了包含噪声的观测值和缺失数据的观测值。
 
在使用数据挖掘算法之前,必须先对目标数据集进行整合。由于数据挖掘只能发现数据中实际存在的模式,目标数据集必须足够大以包含这些模式,同时保持足够简洁以便在可接受的时间限制内进行挖掘。数据的公共源是数据集市或数据仓库。在数据挖掘之前,对多变量数据集进行预处理是必不可少的。然后清理目标集。数据清理去除了包含噪声的观测值和缺失数据的观测值。
第110行: 第110行:  
在使用数据挖掘算法之前,必须组装目标数据集。由于数据挖掘只能发现数据中实际存在的模式,因此目标数据集必须足够大以包含这些模式,同时保持足够简洁,以便在可接受的时间限制内进行挖掘。数据的常见来源是'''<font color="#ff8000">数据集市 Data Mart</font>'''或'''<font color="#ff8000">数据仓库 Data Warehouse</font>'''。在数据挖掘之前,对'''<font color="#ff8000">多元 Multivariate</font>'''数据集进行预处理是必不可少的,然后对目标集进行清洗。数据清洗将删除包含'''<font color="#ff8000">噪声 Noise</font>'''的观测值和'''<font color="#ff8000">缺失数据 Missing Data</font>'''的观测值。
 
在使用数据挖掘算法之前,必须组装目标数据集。由于数据挖掘只能发现数据中实际存在的模式,因此目标数据集必须足够大以包含这些模式,同时保持足够简洁,以便在可接受的时间限制内进行挖掘。数据的常见来源是'''<font color="#ff8000">数据集市 Data Mart</font>'''或'''<font color="#ff8000">数据仓库 Data Warehouse</font>'''。在数据挖掘之前,对'''<font color="#ff8000">多元 Multivariate</font>'''数据集进行预处理是必不可少的,然后对目标集进行清洗。数据清洗将删除包含'''<font color="#ff8000">噪声 Noise</font>'''的观测值和'''<font color="#ff8000">缺失数据 Missing Data</font>'''的观测值。
   −
===数据挖掘 Data mining===
+
===数据挖掘===
    
数据挖掘涉及六类常见的任务:<ref name="Fayyad">{{cite web |last1=Fayyad |first1=Usama |authorlink1=Usama Fayyad |last2=Piatetsky-Shapiro |first2=Gregory|authorlink2=Gregory Piatetsky-Shapiro |last3=Smyth |first3=Padhraic |title=From Data Mining to Knowledge Discovery in Databases |year=1996 |url=http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf |accessdate = 17 December 2008 }}</ref>
 
数据挖掘涉及六类常见的任务:<ref name="Fayyad">{{cite web |last1=Fayyad |first1=Usama |authorlink1=Usama Fayyad |last2=Piatetsky-Shapiro |first2=Gregory|authorlink2=Gregory Piatetsky-Shapiro |last3=Smyth |first3=Padhraic |title=From Data Mining to Knowledge Discovery in Databases |year=1996 |url=http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf |accessdate = 17 December 2008 }}</ref>
第119行: 第119行:  
* '''<font color="#ff8000">关联规则学习 Association rule learning</font>'''(依赖关系建模):探寻变量之间的关系。例如,超市可能会收集顾客购买习惯的数据。通过使用关联规则学习,超市可以确定哪些产品经常被一起购买,并将这些信息用于营销策略改进。这种研究有时被称为“市场篮子分析”。
 
* '''<font color="#ff8000">关联规则学习 Association rule learning</font>'''(依赖关系建模):探寻变量之间的关系。例如,超市可能会收集顾客购买习惯的数据。通过使用关联规则学习,超市可以确定哪些产品经常被一起购买,并将这些信息用于营销策略改进。这种研究有时被称为“市场篮子分析”。
   −
* [[Cluster analysis|Clustering]] – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
+
*'''<font color="#ff8000">聚类 Clustering</font>''':是指在数据中发现以某种方式或其他方式“相似”的组和结构,而不使用数据中已知的结构。
   −
'''<font color="#ff8000">聚类 Clustering</font>''':是指在数据中发现以某种方式或其他方式“相似”的组和结构,而不使用数据中已知的结构。
+
*'''<font color="#ff8000">分类 Classification</font>''':是归纳已知结构并应用于新数据的任务。例如,电子邮件程序可能会尝试将电子邮件分类为“合法”或“垃圾邮件”。
 
  −
* [[Statistical classification|Classification]] – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
  −
 
  −
'''<font color="#ff8000">分类 Classification</font>''':是将已知结构归纳为新数据的任务。例如,电子邮件程序可能会尝试将电子邮件分类为“合法”或“垃圾邮件”。
      
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】“是将已知结构归纳为新数据的任务”改为“是归纳已知结构并应用于新数据的任务”
 
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】“是将已知结构归纳为新数据的任务”改为“是归纳已知结构并应用于新数据的任务”
   −
* [[Regression analysis|Regression]] – attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets.
+
*'''<font color="#ff8000">回归</font>''':试图找到一个对数据建模误差最小的函数,也就是说,用于估计数据或数据集之间的关系。
 
  −
'''<font color="#ff8000">回归</font>''':试图找到一个对数据建模误差最小的函数,也就是说,用于估计数据或数据集之间的关系。
  −
 
  −
* [[Automatic summarization|Summarization]] – providing a more compact representation of the data set, including visualization and report generation.
     −
'''<font color="#ff8000">自动文摘 Automatic summarizatio</font>''':提供数据集更紧凑、简洁的表示,包括可视化和报告生成。
+
*'''<font color="#ff8000">总结 Summarization</font>''':提供数据集更紧凑、简洁的表示,包括可视化和报告生成。
    
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】“自动文摘 Automatic summarizatio”改为“总结 Summarization”
 
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】“自动文摘 Automatic summarizatio”改为“总结 Summarization”
   −
===结果验证 Results validation===
+
===结果验证===
 
  −
[[File:Spurious correlations - spelling bee spiders.svg|thumb|An example of data produced by [[data dredging]] through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence. 一个由统计学家泰勒·维根 Tyler Vigen操作的机器人进行数据挖掘所产生的数据,显然表明在拼字比赛中获胜的最佳单词与美国被毒蜘蛛杀死的人数之间有着密切的联系。但是显然这种趋势上的相似仅仅是一个巧合。]]
  −
 
  −
An example of data produced by data dredging through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence.
      
一个由统计学家泰勒·维根 Tyler Vigen操作的机器人进行数据挖掘所产生的数据,显然表明在拼字比赛中获胜的最佳单词与美国被毒蜘蛛杀死的人数之间有着密切的联系。但是显然这种趋势上的相似仅仅是一个巧合。
 
一个由统计学家泰勒·维根 Tyler Vigen操作的机器人进行数据挖掘所产生的数据,显然表明在拼字比赛中获胜的最佳单词与美国被毒蜘蛛杀死的人数之间有着密切的联系。但是显然这种趋势上的相似仅仅是一个巧合。
   −
Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be [[Reproducibility|reproduced]] on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper [[statistical hypothesis testing]]. A simple version of this problem in [[machine learning]] is known as [[overfitting]], but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.<ref name=hawkins>{{cite journal | last1 = Hawkins | first1 = Douglas M | year = 2004 | title = The problem of overfitting | url = | journal = Journal of Chemical Information and Computer Sciences | volume = 44 | issue = 1| pages = 1–12 | doi=10.1021/ci0342472| pmid = 14741005 }}</ref>
+
数据挖掘可能会在无意中被误用,然后产生看似重要的结果; 但这些结果实际上并不能用来预测未来的行为,也不能在新的数据样本上进行复现,而且用处不大。这通常是由于做出太多的假设,而没有进行适当的'''<font color="#ff8000">统计假设检验 Statistical Hypothesis Testing</font>'''。在机器学习中,这种问题可以被简称为'''<font color="#ff8000">过拟合 Overfitting</font>''',但相同的问题可能会在过程的不同阶段出现,因此哪怕在完全适用的情况下,合理进行训练/测试分割这一种方法也可能不足以防止这种情况的发生。<ref name=hawkins>{{cite journal | last1 = Hawkins | first1 = Douglas M | year = 2004 | title = The problem of overfitting | url = | journal = Journal of Chemical Information and Computer Sciences | volume = 44 | issue = 1| pages = 1–12 | doi=10.1021/ci0342472| pmid = 14741005 }}</ref>
 
  −
Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split—when applicable at all—may not be sufficient to prevent this from happening.
  −
 
  −
数据挖掘可能会在无意中被误用,然后产生看似重要的结果; 但这些结果实际上并不能用来预测未来的行为,也不能在新的数据样本上进行复现,而且用处不大。这通常是由于做出太多的假设,而没有进行适当的'''<font color="#ff8000">统计假设检验 Statistical Hypothesis Testing</font>'''。在机器学习中,这种问题可以被简称为'''<font color="#ff8000">过拟合 Overfitting</font>''',但相同的问题可能会在过程的不同阶段出现,因此哪怕在完全适用的情况下,合理进行训练/测试分割这一种方法也可能不足以防止这种情况的发生。
      
{{Missing information|section|non-classification tasks in data mining. It only covers [[machine learning]]|date=September 2011}}
 
{{Missing information|section|non-classification tasks in data mining. It only covers [[machine learning]]|date=September 2011}}
   −
The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called [[overfitting]]. To overcome this, the evaluation uses a [[test set]] of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a [[training set]] of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had ''not'' been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as [[Receiver operating characteristic|ROC curves]].
+
从数据中发现知识的最后一步是验证数据挖掘算法产生的模式是否存在于更广泛的数据集中。数据挖掘算法发现的并非所有模式都是有效的,因为对于数据挖掘算法来说,在训练集中发现一般数据集中没有的模式是很常见的,这叫做'''<font color="#ff8000">过拟合 Overfitting</font>'''。为了解决这个问题,评估时会使用一组没有用在训练数据挖掘算法中用到的测试数据。然后将学习到的模式应用到这个'''<font color="#ff8000">测试集 Test Set</font>'''中,并将结果输出与期望的输出进行比较。例如,试图区分“垃圾邮件”和“合法”邮件的数据挖掘算法将根据一组电子邮件'''<font color="#ff8000">训练集 Training Sett</font>'''样本进行训练。训练完毕后,学到的模式将应用于未经训练的那部分电子邮件测试集数据上。然后,可以从这些模式正确分类的电子邮件数量来衡量这些模式的准确性。可以使用几种统计方法可以用来评估算法,如'''<font color="#ff8000">ROC 曲线 ROC curves</font>'''。
 
  −
The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as ROC curves.
  −
 
  −
从数据中发现知识的最后一步是验证数据挖掘算法产生的模式是否存在于更广泛的数据集中。数据挖掘算法发现的并非所有模式都是有效的,因为对于数据挖掘算法来说,在训练集中发现一般数据集中没有的模式是很常见的,这叫做'''<font color="#ff8000">过拟合 Overfitting</font>'''。为了克服这个问题,评估使用一组测试数据,而数据挖掘算法并没有在这些测试数据上进行训练。然后将学习到的模式应用到这个'''<font color="#ff8000">测试集 Test Set</font>'''中,并将结果输出与期望的输出进行比较。例如,试图区分“垃圾邮件”和“合法”邮件的数据挖掘算法将根据一组电子邮件'''<font color="#ff8000">训练集 Training Sett</font>'''样本进行训练。训练完毕后,学到的模式将应用于未经训练的那部分电子邮件测试集数据上。然后,可以从这些模式正确分类的电子邮件数量来衡量这些模式的准确性。可以使用几种统计方法可以用来评估算法,如'''<font color="#ff8000">ROC 曲线 ROC curves</font>'''。
      
   --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】“为了克服这个问题,评估使用一组测试数据,而数据挖掘算法并没有在这些测试数据上进行训练”改为“为了解决这个问题,评估时会使用一组没有用在训练数据挖掘算法中用到的测试数据”
 
   --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】“为了克服这个问题,评估使用一组测试数据,而数据挖掘算法并没有在这些测试数据上进行训练”改为“为了解决这个问题,评估时会使用一组没有用在训练数据挖掘算法中用到的测试数据”
  −
If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.
  −
  −
If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.
      
如果学习的模式不能达到预期的标准,那么就需要重新评估和修改预处理和数据挖掘的步骤。如果所学的模式确实符合所需的标准,那么最后一步就是对习得的模式进行解释并将其转化为知识。
 
如果学习的模式不能达到预期的标准,那么就需要重新评估和修改预处理和数据挖掘的步骤。如果所学的模式确实符合所需的标准,那么最后一步就是对习得的模式进行解释并将其转化为知识。
   −
==研究 Research==
+
==研究==
 
  −
The premier professional body in the field is the [[Association for Computing Machinery]]'s (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining ([[SIGKDD]]).<ref>{{cite web|url=http://academic.research.microsoft.com/?SearchDomain=2&SubDomain=7&entitytype=2|title=Microsoft Academic Search: Top conferences in data mining | publisher=[[Microsoft Academic Search]]}}</ref><ref>{{cite web|url=https://scholar.google.de/citations?view_op=top_venues&hl=en&vq=eng_datamininganalysis|title=Google Scholar: Top publications - Data Mining & Analysis|publisher=[[Google Scholar]]}}</ref> Since 1989, this ACM SIG has hosted an annual international conference and published its proceedings,<ref>[http://www.kdd.org/conferences.php Proceedings] {{Webarchive|url=https://web.archive.org/web/20100430120252/http://www.kdd.org/conferences.php |date=2010-04-30 }}, International Conferences on Knowledge Discovery and Data Mining, ACM, New York.</ref> and since 1999 it has published a biannual [[academic journal]] titled "SIGKDD Explorations".<ref>[http://www.kdd.org/explorations/about.php SIGKDD Explorations], ACM, New York.</ref>
  −
 
  −
The premier professional body in the field is the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD). Since 1989, this ACM SIG has hosted an annual international conference and published its proceedings, and since 1999 it has published a biannual academic journal titled "SIGKDD Explorations".
  −
 
  −
该领域的首要专业机构是计算机协会 ACM的知识发现和数据挖掘特别兴趣小组 SIGKDD。自1989年以来,ACM SIG每年举办一次国际会议并出版会议记录,自1999年起,它还出版了一份名为“SIGKDD探索”的两年期学术期刊。
      +
该领域的首要专业机构是计算机协会 ACM的知识发现和数据挖掘特别兴趣小组 SIGKDD。<ref>{{cite web|url=http://academic.research.microsoft.com/?SearchDomain=2&SubDomain=7&entitytype=2|title=Microsoft Academic Search: Top conferences in data mining | publisher=[[Microsoft Academic Search]]}}</ref><ref>{{cite web|url=https://scholar.google.de/citations?view_op=top_venues&hl=en&vq=eng_datamininganalysis|title=Google Scholar: Top publications - Data Mining & Analysis|publisher=[[Google Scholar]]}}</ref>自1989年以来,ACM SIG每年举办一次国际会议并出版会议记录,自1999年起,它还出版了一份名为“SIGKDD探索”的两年期学术期刊。<ref>[http://www.kdd.org/conferences.php Proceedings] {{Webarchive|url=https://web.archive.org/web/20100430120252/http://www.kdd.org/conferences.php |date=2010-04-30 }}, International Conferences on Knowledge Discovery and Data Mining, ACM, New York.</ref> and since 1999 it has published a biannual [[academic journal]] titled "SIGKDD Explorations".<ref>[http://www.kdd.org/explorations/about.php SIGKDD Explorations], ACM, New York.</ref>
   −
Computer science conferences on data mining include:
  −
  −
Computer science conferences on data mining include:
      
关于数据挖掘的计算机科学会议包括:
 
关于数据挖掘的计算机科学会议包括:
    +
*CIKM会议 :ACM'''<font color="#ff8000">信息和知识管理会议 Conference on Information and Knowledge Management</font>'''。
    +
*'''<font color="#ff8000">欧洲机器学习与数据库知识发现原理与实践会议 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases</font>'''
   −
* [[CIKM Conference]] – ACM [[Conference on Information and Knowledge Management]]
+
*KDD会议:ACM SIGKDD的'''<font color="#ff8000">知识发现与数据挖掘会议 Conference on Knowledge Discovery and Data Mining</font>'''
 
  −
CIKM会议 :ACM'''<font color="#ff8000">信息和知识管理会议 Conference on Information and Knowledge Management</font>'''。
  −
 
  −
* [[European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases]]
  −
 
  −
'''<font color="#ff8000">欧洲机器学习与数据库知识发现原理与实践会议 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases</font>'''
  −
 
  −
* [[KDD Conference]] – ACM SIGKDD [[Conference on Knowledge Discovery and Data Mining]]
  −
 
  −
KDD会议:ACM SIGKDD的'''<font color="#ff8000">知识发现与数据挖掘会议 Conference on Knowledge Discovery and Data Mining</font>'''
  −
 
  −
Data mining topics are also present on many [[List of computer science conferences#Data Management|data management/database conferences]] such as the ICDE Conference, [[SIGMOD|SIGMOD Conference]] and [[International Conference on Very Large Data Bases]]
  −
 
  −
Data mining topics are also present on many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases.
      
数据挖掘专题也出现在许多数据管理/数据库会议上,如 ICDE会议、 '''<font color="#ff8000">SIGMOD会议 SIGMOD Conference</font>'''和'''<font color="#ff8000">关于超大数据库国际会议International Conference on Very Large Data Bases</font>'''。
 
数据挖掘专题也出现在许多数据管理/数据库会议上,如 ICDE会议、 '''<font color="#ff8000">SIGMOD会议 SIGMOD Conference</font>'''和'''<font color="#ff8000">关于超大数据库国际会议International Conference on Very Large Data Bases</font>'''。
   −
==标准 Standards==
+
==标准==
 
  −
There have been some efforts to define standards for the data mining process, for example, the 1999 European [[Cross Industry Standard Process for Data Mining]] (CRISP-DM 1.0) and the 2004 [[Java Data Mining]] standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006 but has stalled since. JDM 2.0 was withdrawn without reaching a final draft.
  −
 
  −
There have been some efforts to define standards for the data mining process, for example, the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006 but has stalled since. JDM 2.0 was withdrawn without reaching a final draft.
     −
为数据挖掘过程定义了一些标准,例如1999年欧洲跨行业数据挖掘标准流程(CRISP-DM 1.0)和2004年Java数据挖掘标准(JDM 1.0)。这些程序的后续程序(CRISP-DM 2.0和 JDM 2.0)的开发活跃于2006年,但此后一直停滞不前。Jdm 2.0没有达成最终草案就被撤销了。
+
人们曾努力为数据挖掘过程定义标准,例如1999年欧洲跨行业数据挖掘标准流程(CRISP-DM 1.0)和2004年Java数据挖掘标准(JDM 1.0)。这些程序的后续程序(CRISP-DM 2.0和 JDM 2.0)的开发活跃于2006年,但此后一直停滞不前。Jdm 2.0没有达成最终草案就被撤销了。
    
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】将“为数据挖掘过程定义了一些标准”改为“人们曾努力为数据挖掘过程定义标准”
 
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】将“为数据挖掘过程定义了一些标准”改为“人们曾努力为数据挖掘过程定义标准”
   −
For exchanging the extracted models – in particular for use in [[predictive analytics]]&nbsp;– the key standard is the [[Predictive Model Markup Language]] (PMML), which is an [[XML]]-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) [[subspace clustering]] have been proposed independently of the DMG.<ref>{{Cite book | last1 = Günnemann | first1 = Stephan | last2 = Kremer | first2 = Hardy | last3 = Seidl | first3 = Thomas | doi = 10.1145/2023598.2023605 | chapter = An extension of the PMML standard to subspace clustering models | title = Proceedings of the 2011 workshop on Predictive markup language modeling - PMML '11 | pages = 48 | year = 2011 | isbn = 978-1-4503-0837-3 | pmid =  | pmc = }}</ref>
+
为了交换所提取的模型,特别是在预测分析中使用,关键的标准是预测模型标记语言 PMML,这是一种基于 XML 的语言,由数据挖掘集团 DMG 开发,并支持作为许多数据挖掘的交换格式的应用程序。顾名思义,它只涵盖预测模型,这是一项特殊的在商业应用中非常重要的数据挖掘任务。然而,覆盖子空间聚类的扩展已经独立于 DMG 被提出。<ref>{{Cite book | last1 = Günnemann | first1 = Stephan | last2 = Kremer | first2 = Hardy | last3 = Seidl | first3 = Thomas | doi = 10.1145/2023598.2023605 | chapter = An extension of the PMML standard to subspace clustering models | title = Proceedings of the 2011 workshop on Predictive markup language modeling - PMML '11 | pages = 48 | year = 2011 | isbn = 978-1-4503-0837-3 | pmid =  | pmc = }}</ref>
 
  −
For exchanging the extracted models – in particular for use in predictive analytics&nbsp;– the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.
     −
为了交换所提取的模型,特别是在预测分析中使用,关键的标准是预测模型标记语言 PMML,这是一种基于 XML 的语言,由数据挖掘集团 DMG 开发,并支持作为许多数据挖掘的交换格式的应用程序。顾名思义,它只涵盖预测模型,这是一项特殊的在商业应用中非常重要的数据挖掘任务。然而,覆盖子空间聚类的扩展已经独立于 DMG 被提出。
+
==主要用途==
 
  −
==主要用途 Notable uses==
      
{{Main|Examples of data mining}}
 
{{Main|Examples of data mining}}
第226行: 第174行:  
{{Category see also|Applied data mining}}
 
{{Category see also|Applied data mining}}
   −
 
+
数据挖掘在任何有数字数据可用的地方都可以被使用。数据挖掘的著名例子可以在商业、医学、科学和监管领域都有数据挖掘的主要应用。
 
  −
Data mining is used wherever there is digital data available today. Notable [[examples of data mining]] can be found throughout business, medicine, science, and surveillance.
  −
 
  −
Data mining is used wherever there is digital data available today. Notable examples of data mining can be found throughout business, medicine, science, and surveillance.
  −
 
  −
数据挖掘在任何有数字数据可用的地方都可以被使用。数据挖掘的著名例子可以在商业、医学、科学和监控领域找到。
      
   --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】 “数据挖掘的著名例子可以在商业、医学、科学和监控领域找到。”改为“在商业、医学、科学和监管领域都有数据挖掘的主要应用”
 
   --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】 “数据挖掘的著名例子可以在商业、医学、科学和监控领域找到。”改为“在商业、医学、科学和监管领域都有数据挖掘的主要应用”
   −
==隐私问题和道德规范 Privacy concerns and ethics==
+
==隐私问题和道德规范==
   −
While the term "data mining" itself may have no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise).<ref>{{cite journal |author=Seltzer, William |title=The Promise and Pitfalls of Data Mining: Ethical Issues |url=https://ww2.amstat.org/committees/ethics/linksdir/Jsm2005Seltzer.pdf|publisher = American Statistical Association|journal = ASA Section on Government Statistics|date = 2005 }}</ref>
+
虽然“数据挖掘”这个术语本身可能没有伦理含义,但它通常与人们伦理和其他行为相关的信息挖掘有关。<ref>{{cite journal |author=Seltzer, William |title=The Promise and Pitfalls of Data Mining: Ethical Issues |url=https://ww2.amstat.org/committees/ethics/linksdir/Jsm2005Seltzer.pdf|publisher = American Statistical Association|journal = ASA Section on Government Statistics|date = 2005 }}</ref>
   −
While the term "data mining" itself may have no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise).
     −
虽然“数据挖掘”这个术语本身可能没有伦理含义,但它通常与人们伦理和其他行为相关的信息挖掘有关。
+
在某些情况下,数据挖掘的使用方式可能会引发隐私、合法性和伦理问题。<ref>{{cite journal |author=Pitts, Chip |title=The End of Illegal Domestic Spying? Don't Count on It |url=http://www.washingtonspectator.com/articles/20070315surveillance_1.cfm |journal=Washington Spectator |date=15 March 2007 |url-status=dead |archiveurl=https://web.archive.org/web/20071128015201/http://www.washingtonspectator.com/articles/20070315surveillance_1.cfm |archivedate=2007-11-28 }}</ref> 特别是,处于国家安全或执法目的而进行的政府或商业数据集的数据挖掘,如在全面信息意识项目或在 ADVISE 中引起了隐私问题。<ref>{{cite journal |author=Taipale, Kim A. |title=Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data |url=http://www.stlr.org/cite.cgi?volume=5&article=2 |journal=Columbia Science and Technology Law Review |volume=5 |issue=2 |date=15 December 2003 |ssrn=546782 |oclc=45263753 }}</ref><ref>{{cite web|last1=Resig|first1=John|title=A Framework for Mining Instant Messaging Services|url=https://johnresig.com/files/research/SIAMPaper.pdf|accessdate=16 March 2018}}</ref>
      −
 
+
数据挖掘需要进行数据准备,以发现损害机密性和隐私义务的信息或模式。实现这一点的一种常见方式是通过'''<font color="#ff8000">数据聚合 Data Aggregation</font>'''。<ref name="NASCIO">[http://www.nascio.org/publications/documents/NASCIO-dataMining.pdf ''Think Before You Dig: Privacy Implications of Data Mining & Aggregation''] {{webarchive|url=https://web.archive.org/web/20081217063043/http://www.nascio.org/publications/documents/NASCIO-dataMining.pdf |date=2008-12-17 }}, NASCIO Research Brief, September 2004</ref> 数据聚合包括以一种便于分析的方式将数据(可能来自不同的来源)组合在一起(但这也可能使私人、个人级别的数据识别变得可推断或以其他方式显而易见)。但这并不是数据挖掘导致的,而是在分析之前以及为分析目的准备数据的结果。当数据被编译后,数据挖掘者或任何有权访问新编译的数据集的人能够识别特定的个人,特别是当数据最初是匿名的时,就会对个人隐私产生威胁。<ref>{{cite magazine |first=Paul |last=Ohm |title=Don't Build a Database of Ruin |magazine=Harvard Business Review |url=http://blogs.hbr.org/cs/2012/08/dont_build_a_database_of_ruin.html}}</ref><ref>Darwin Bond-Graham, [http://www.counterpunch.org/2013/12/03/iron-cagebook/ Iron Cagebook - The Logical End of Facebook's Patents], [[Counterpunch.org]], 2013.12.03</ref><ref>Darwin Bond-Graham, [http://www.counterpunch.org/2013/09/11/inside-the-tech-industrys-startup-conference/ Inside the Tech industry's Startup Conference], [[Counterpunch.org]], 2013.09.11</ref>
The ways in which data mining can be used can in some cases and contexts raise questions regarding [[privacy]], legality, and ethics.<ref>{{cite journal |author=Pitts, Chip |title=The End of Illegal Domestic Spying? Don't Count on It |url=http://www.washingtonspectator.com/articles/20070315surveillance_1.cfm |journal=Washington Spectator |date=15 March 2007 |url-status=dead |archiveurl=https://web.archive.org/web/20071128015201/http://www.washingtonspectator.com/articles/20070315surveillance_1.cfm |archivedate=2007-11-28 }}</ref> In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the [[Total Information Awareness]] Program or in [[ADVISE]], has raised privacy concerns.<ref>{{cite journal |author=Taipale, Kim A. |title=Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data |url=http://www.stlr.org/cite.cgi?volume=5&article=2 |journal=Columbia Science and Technology Law Review |volume=5 |issue=2 |date=15 December 2003 |ssrn=546782 |oclc=45263753 }}</ref><ref>{{cite web|last1=Resig|first1=John|title=A Framework for Mining Instant Messaging Services|url=https://johnresig.com/files/research/SIAMPaper.pdf|accessdate=16 March 2018}}</ref>
  −
 
  −
The ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics. In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.
  −
 
  −
在某些情况下,数据挖掘的使用方式可能会引发隐私、合法性和伦理问题。特别是,处于国家安全或执法目的而进行的政府或商业数据集的数据挖掘,如在全面信息意识项目或在 ADVISE 中引起了隐私问题。
  −
 
  −
 
  −
Data mining requires data preparation which uncovers information or patterns which compromise confidentiality and privacy obligations. A common way for this to occur is through [[aggregate function|data aggregation]]. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent).<ref name="NASCIO">[http://www.nascio.org/publications/documents/NASCIO-dataMining.pdf ''Think Before You Dig: Privacy Implications of Data Mining & Aggregation''] {{webarchive|url=https://web.archive.org/web/20081217063043/http://www.nascio.org/publications/documents/NASCIO-dataMining.pdf |date=2008-12-17 }}, NASCIO Research Brief, September 2004</ref> This is not data mining ''per se'', but a result of the preparation of data before – and for the purposes of – the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.<ref>{{cite magazine |first=Paul |last=Ohm |title=Don't Build a Database of Ruin |magazine=Harvard Business Review |url=http://blogs.hbr.org/cs/2012/08/dont_build_a_database_of_ruin.html}}</ref><ref>Darwin Bond-Graham, [http://www.counterpunch.org/2013/12/03/iron-cagebook/ Iron Cagebook - The Logical End of Facebook's Patents], [[Counterpunch.org]], 2013.12.03</ref><ref>Darwin Bond-Graham, [http://www.counterpunch.org/2013/09/11/inside-the-tech-industrys-startup-conference/ Inside the Tech industry's Startup Conference], [[Counterpunch.org]], 2013.09.11</ref>
  −
 
  −
Data mining requires data preparation which uncovers information or patterns which compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent). This is not data mining per se, but a result of the preparation of data before – and for the purposes of – the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.
  −
 
  −
数据挖掘需要进行数据准备,以发现损害机密性和隐私义务的信息或模式。实现这一点的一种常见方式是通过'''<font color="#ff8000">数据聚合 Data Aggregation</font>'''。数据聚合包括以一种便于分析的方式将数据(可能来自不同的来源)组合在一起(但这也可能使私人、个人级别的数据识别变得可推断或以其他方式显而易见)。但这并不是数据挖掘导致的,而是在分析之前以及为分析目的准备数据的结果。当数据被编译后,数据挖掘者或任何有权访问新编译的数据集的人能够识别特定的个人,特别是当数据最初是匿名的时,对个人隐私的威胁就开始发挥作用了。
   
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】将“对个人隐私的威胁就开始发挥作用了”改为“就会对个人隐私产生威胁”
 
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】将“对个人隐私的威胁就开始发挥作用了”改为“就会对个人隐私产生威胁”
  −
It is recommended{{whom|date=August 2019}} to be aware of the following '''before''' data are collected:<ref name="NASCIO" />
  −
  −
It is recommended to be aware of the following before data are collected:
      
在收集数据之前,建议注意以下事项:
 
在收集数据之前,建议注意以下事项:
   −
* The purpose of the data collection and any (known) data mining projects;
+
*&数据收集和任何(已知的)数据挖掘项目的目的;
 
  −
数据收集和任何(已知的)数据挖掘项目的目的;
  −
 
  −
* How the data will be used;
  −
 
  −
数据使用的方法;
  −
 
  −
* Who will be able to mine the data and use the data and their derivatives;
  −
 
  −
谁将能够挖掘数据并使用这些数据及其衍生工具;
  −
 
  −
* The status of security surrounding access to the data;
     −
数据访问的安全状态;
+
*数据使用的方法;
   −
* How collected data can be updated.
+
*谁将能够挖掘数据并使用这些数据及其衍生工具;
   −
如何更新收集的数据。
+
*数据访问的安全状态;
   −
Data may also be modified so as to ''become'' anonymous, so that individuals may not readily be identified.<ref name="NASCIO" /> However, even ""anonymized" data sets can potentially contain enough information to allow identification of individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL.<ref>[http://www.securityfocus.com/brief/277 ''AOL search data identified individuals''], SecurityFocus, August 2006</ref>
+
*如何更新收集的数据。
   −
Data may also be modified so as to become anonymous, so that individuals may not readily be identified.However, even ""anonymized" data sets can potentially contain enough information to allow identification of individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL.
+
数据也可以被修改成匿名的,这样个人就不会轻易地被识别。但是,甚至“匿名化”的数据集也可能包含足够的信息用来识别个人,就像记者能够依据‘美国在线’无意中发布的用户历史记录找到一些个人。<ref>[http://www.securityfocus.com/brief/277 ''AOL search data identified individuals''], SecurityFocus, August 2006</ref>
   −
数据也可以被修改成匿名的,这样个人就不容易被修改了确定。但是,甚至“匿名化”的数据集也可能包含足够的信息用来识别个人,就像记者能够根据一组无意中搜索历史找到几个个人一样美国在线发布。
      
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】将“这样个人就不容易被修改了确定”改为“这样个人就不会轻易地被识别”
 
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】将“这样个人就不容易被修改了确定”改为“这样个人就不会轻易地被识别”
第296行: 第208行:  
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】将“就像记者能够根据一组无意中搜索历史找到几个个人一样美国在线发布”改为“就像记者能够依据‘美国在线’无意中发布的用户历史记录找到一些个人”
 
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】将“就像记者能够根据一组无意中搜索历史找到几个个人一样美国在线发布”改为“就像记者能够依据‘美国在线’无意中发布的用户历史记录找到一些个人”
   −
The inadvertent revelation of [[personally identifiable information]] leading to the provider violates Fair Information Practices.  This indiscretion can cause financial, emotional, or bodily harm to the indicated individual.  In one instance of [[privacy violation]], the patrons of Walgreens filed a lawsuit against the company in 2011 for selling prescription information to data mining companies who in turn provided the data to pharmaceutical companies.<ref>{{Cite journal|title = Big data׳s impact on privacy, security and consumer welfare|journal = Telecommunications Policy|pages = 1134–1145|volume = 38|issue = 11|doi = 10.1016/j.telpol.2014.10.002|first = Nir|last = Kshetri|year = 2014|url = http://libres.uncg.edu/ir/uncg/f/N_Kshetri_Big_2014.pdf}}</ref>
  −
  −
The inadvertent revelation of personally identifiable information leading to the provider violates Fair Information Practices.  This indiscretion can cause financial, emotional, or bodily harm to the indicated individual.  In one instance of privacy violation, the patrons of Walgreens filed a lawsuit against the company in 2011 for selling prescription information to data mining companies who in turn provided the data to pharmaceutical companies.
      +
无意中泄露个人身份信息导致提供者违反了公平信息惯例。这种轻率的行为会对指定的个人造成经济、情感或身体伤害。在一起侵犯隐私的案例中,沃尔格林 Walgreens的赞助人在2011年对该公司提起诉讼,指控该公司向数据挖掘公司出售处方信息,而数据挖掘公司又将这些数据提供给制药公司。<ref>{{Cite journal|title = Big data׳s impact on privacy, security and consumer welfare|journal = Telecommunications Policy|pages = 1134–1145|volume = 38|issue = 11|doi = 10.1016/j.telpol.2014.10.002|first = Nir|last = Kshetri|year = 2014|url = http://libres.uncg.edu/ir/uncg/f/N_Kshetri_Big_2014.pdf}}</ref>
   −
无意中泄露个人身份信息导致提供者违反了公平信息惯例。这种轻率的行为会对指定的个人造成经济、情感或身体伤害。在一起侵犯隐私的案例中,沃尔格林 Walgreens的赞助人在2011年对该公司提起诉讼,指控该公司向数据挖掘公司出售处方信息,而数据挖掘公司又将这些数据提供给制药公司。
         +
===欧洲的情况===
   −
===欧洲的情况 Situation in Europe===
+
欧洲有相当严密的隐私法,正在努力进一步加强消费者的权利。然而,1998年至2000年期间制定的《美国-欧盟安全港原则》(U.S.-E.U.Safe Harbor Principles)目前有效地使欧洲用户受到美国公司的隐私剥削。由于爱德华·斯诺登 Edward Snowden披露了全球监控信息后,关于撤销这一协议的讨论越来越多,讨论的话题主要关于把数据完全暴露给国家安全局,与美国达成协议的尝试失败这些事上。<ref>{{cite web |url=https://crsreports.congress.gov/product/pdf/R/R44257/7 |title=U.S.-E.U. Data Privacy: From Safe Harbor to Privacy Shield |last1=Weiss |first1=Martin A. |last2=Archick |first2=Kristin |date=19 May 2016 |department= |website= |series= |agency=Congressional Research Service |location=Washington, D.C. |page=6 |pages= |format=PDF |id=R44257 |access-date=9 April 2020 |quote=On October 6, 2015, the [[Court of Justice of the European Union|CJEU]]&nbsp;... issued a decision that invalidated Safe Harbor (effective immediately), as currently implemented. }}</ref>
 
  −
 
  −
 
  −
[[European Union|Europe]] has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the [[International Safe Harbor Privacy Principles|U.S.-E.U. Safe Harbor Principles]], developed between 1998 and 2000, currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of [[Edward Snowden]]'s [[global surveillance disclosure]], there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the [[National Security Agency]], and attempts to reach an agreement with the United States have failed.<ref>{{cite web |url=https://crsreports.congress.gov/product/pdf/R/R44257/7 |title=U.S.-E.U. Data Privacy: From Safe Harbor to Privacy Shield |last1=Weiss |first1=Martin A. |last2=Archick |first2=Kristin |date=19 May 2016 |department= |website= |series= |agency=Congressional Research Service |location=Washington, D.C. |page=6 |pages= |format=PDF |id=R44257 |access-date=9 April 2020 |quote=On October 6, 2015, the [[Court of Justice of the European Union|CJEU]]&nbsp;... issued a decision that invalidated Safe Harbor (effective immediately), as currently implemented. }}</ref>
  −
 
  −
Europe has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.-E.U. Safe Harbor Principles, developed between 1998 and 2000, currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of Edward Snowden's global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement with the United States have failed.
  −
 
  −
欧洲有相当严密的隐私法,正在努力进一步加强消费者的权利。然而,1998年至2000年期间制定的《美国-欧盟安全港原则》(U.S.-E.U.Safe Harbor Principles)目前有效地使欧洲用户受到美国公司的隐私剥削。由于爱德华·斯诺登 Edward Snowden披露了全球监控信息后,关于撤销这一协议的讨论越来越多,讨论的话题主要关于把数据完全暴露给国家安全局,与美国达成协议的尝试失败这些事上。
      
   --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】"目前有效地使欧洲用户受到美国公司的隐私剥削"一句改为"在当下让欧洲用户的隐私泄露给美国公司以利用”
 
   --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】"目前有效地使欧洲用户受到美国公司的隐私剥削"一句改为"在当下让欧洲用户的隐私泄露给美国公司以利用”
===美国的情况 Situation in the United States===
+
===美国的情况===
 
  −
In the United States, privacy concerns have been addressed by the [[US Congress]] via the passage of regulatory controls such as the [[Health Insurance Portability and Accountability Act]] (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in ''Biotech Business Week'', "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals."<ref>Biotech Business Week Editors (June 30, 2008); ''BIOMEDICINE; HIPAA Privacy Rule Impedes Biomedical Research'', Biotech Business Week, retrieved 17 November 2009 from LexisNexis Academic</ref> This underscores the necessity for data anonymity in data aggregation and mining practices.
  −
 
  −
In the United States, privacy concerns have been addressed by the US Congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in Biotech Business Week, "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals." This underscores the necessity for data anonymity in data aggregation and mining practices.
      
在美国,美国国会通过了《健康保险便携性和责任法案》(HIPAA)等监管措施解决了隐私问题。HIPAA要求个人就其提供的信息及其当前和未来的预期用途给予“知情同意”。根据《生物技术商业周刊》的一篇文章,“实际上在研究领域HIPAA可能不会比长期存在的法规提供更好的保护。”。更重要的是,该规则通过知情同意进行保护的目标是接近普通个人的不可理解程度。”这突出了数据聚合和挖掘实践中数据匿名的必要性。
 
在美国,美国国会通过了《健康保险便携性和责任法案》(HIPAA)等监管措施解决了隐私问题。HIPAA要求个人就其提供的信息及其当前和未来的预期用途给予“知情同意”。根据《生物技术商业周刊》的一篇文章,“实际上在研究领域HIPAA可能不会比长期存在的法规提供更好的保护。”。更重要的是,该规则通过知情同意进行保护的目标是接近普通个人的不可理解程度。”这突出了数据聚合和挖掘实践中数据匿名的必要性。
  −
  −
U.S. information privacy legislation such as HIPAA and the [[Family Educational Rights and Privacy Act]] (FERPA) applies only to the specific areas that each such law addresses. The use of data mining by the majority of businesses in the U.S. is not controlled by any legislation.
  −
  −
U.S. information privacy legislation such as HIPAA and the Family Educational Rights and Privacy Act (FERPA) applies only to the specific areas that each such law addresses. The use of data mining by the majority of businesses in the U.S. is not controlled by any legislation.
      
美国信息隐私立法,如 HIPAA 和《家庭教育权利和隐私法》(FERPA)仅适用于每一个此类法律所涉及的特定领域。美国大多数企业对数据挖掘的使用并不受任何法律的控制。
 
美国信息隐私立法,如 HIPAA 和《家庭教育权利和隐私法》(FERPA)仅适用于每一个此类法律所涉及的特定领域。美国大多数企业对数据挖掘的使用并不受任何法律的控制。
   −
== 数据挖掘与著作权法 Copyright law==
+
== 数据挖掘与著作权法==
 
        −
===欧洲 Situation in Europe===
      +
===欧洲===
    +
根据欧洲版权法和数据库法,未经版权所有人许可而对版权作品进行挖掘(如通过网络挖掘)是不合法的。在欧洲,如果数据库是纯数据,可能没有版权,但数据库权利可能存在,因此数据挖掘受数据库指令保护的知识产权所有者的权利约束。《哈格里夫斯评论》(Hargreaves review)指出,这使得英国政府在2014年修订了版权法,允许将内容挖掘作为一种限制和例外。<ref>[http://www.out-law.com/en/articles/2014/june/researchers-given-data-mining-right-under-new-uk-copyright-laws/ UK Researchers Given Data Mining Right Under New UK Copyright Laws.] {{webarchive |url=https://web.archive.org/web/20140609020315/http://www.out-law.com/en/articles/2014/june/researchers-given-data-mining-right-under-new-uk-copyright-laws/ |date=June 9, 2014 }} ''Out-Law.com.''  Retrieved 14 November 2014</ref>英国是继日本之后世界上第二个这样做的国家,日本在2009年把数据挖掘作为一个特例。然而,由于信息社会指令(2001年)的限制,英国是例外情况只允许非商业目的的内容挖掘。英国版权法也不允许合同条款和条件推翻这一规定。
   −
Under [[Copyright law of the European Union|European copyright]] and [[Database Directive|database law]]s, the mining of in-copyright works (such as by [[web mining]]) without the permission of the copyright owner is not legal. Where a database is pure data in Europe, it may be that there is no copyright{{snd}} but database rights may exist so data mining becomes subject to [[intellectual property]] owners' rights that are protected by the [[Database Directive]]. On the recommendation of the [[Hargreaves review]], this led to the UK government to amend its copyright law in 2014 to allow content mining as a [[Limitations and exceptions to copyright|limitation and exception]].<ref>[http://www.out-law.com/en/articles/2014/june/researchers-given-data-mining-right-under-new-uk-copyright-laws/ UK Researchers Given Data Mining Right Under New UK Copyright Laws.] {{webarchive |url=https://web.archive.org/web/20140609020315/http://www.out-law.com/en/articles/2014/june/researchers-given-data-mining-right-under-new-uk-copyright-laws/ |date=June 9, 2014 }} ''Out-Law.com.''  Retrieved 14 November 2014</ref> The UK was the second country in the world to do so after Japan, which introduced an exception in 2009 for data mining. However, due to the restriction of the [[Information Society Directive]] (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law also does not allow this provision to be overridden by contractual terms and conditions.
+
  --[[用户:Zengsihang|Zengsihang]][[用户讨论:Zengsihang|讨论]]) 【审校】将“英国是例外情况但是只允许给商业目的的内容挖掘”改为“ ”
   −
Under European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is not legal. Where a database is pure data in Europe, it may be that there is no copyright but database rights may exist so data mining becomes subject to intellectual property owners' rights that are protected by the Database Directive. On the recommendation of the Hargreaves review, this led to the UK government to amend its copyright law in 2014 to allow content mining as a limitation and exception. The UK was the second country in the world to do so after Japan, which introduced an exception in 2009 for data mining. However, due to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law also does not allow this provision to be overridden by contractual terms and conditions.
+
2013年,欧盟委员会以“欧洲许可证”为题,.<ref>{{cite web|title=Licences for Europe - Structured Stakeholder Dialogue 2013|url=http://ec.europa.eu/licences-for-europe-dialogue/en/content/about-site|website=European Commission|accessdate=14 November 2014}}</ref>推动了利益相关者对文本和数据挖掘的讨论。但他们将重点放在解决这一法律问题上,比如如何许可它而不是如何限制它或者把它作为一个例外,这使得大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表等利益相关者于2013年5月结束了讨论。<ref>{{cite web|title=Text and Data Mining:Its importance and the need for change in Europe|url=http://libereurope.eu/news/text-and-data-mining-its-importance-and-the-need-for-change-in-europe/|website=Association of European Research Libraries|accessdate=14 November 2014}}</ref>
 
  −
根据欧洲版权法和数据库法,未经版权所有人许可而对版权作品进行挖掘(如通过网络挖掘)是不合法的。在欧洲,如果数据库是纯数据,可能没有版权,但数据库权利可能存在,因此数据挖掘受数据库指令保护的知识产权所有者的权利约束。《哈格里夫斯评论》(Hargreaves review)指出,这使得英国政府在2014年修订了版权法,允许将内容挖掘作为一种限制和例外。英国是继日本之后世界上第二个这样做的国家,日本在2009年把数据挖掘作为一个特例。然而,由于信息社会指令(2001年)的限制,英国是例外情况只允许非商业目的的内容挖掘。英国版权法也不允许合同条款和条件推翻这一规定。
  −
 
  −
  --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】将“英国是例外情况但是只允许给商业目的的内容挖掘”改为“英国对于内容挖掘的例外只允许非商业目的的内容挖掘”
  −
 
  −
The [[European Commission]] facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe.<ref>{{cite web|title=Licences for Europe - Structured Stakeholder Dialogue 2013|url=http://ec.europa.eu/licences-for-europe-dialogue/en/content/about-site|website=European Commission|accessdate=14 November 2014}}</ref> The focus on the solution to this legal issue, such as licensing rather than limitations and exceptions, led to representatives of universities, researchers, libraries, civil society groups and [[open access]] publishers to leave the stakeholder dialogue in May 2013.<ref>{{cite web|title=Text and Data Mining:Its importance and the need for change in Europe|url=http://libereurope.eu/news/text-and-data-mining-its-importance-and-the-need-for-change-in-europe/|website=Association of European Research Libraries|accessdate=14 November 2014}}</ref>
  −
 
  −
The European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe. The focus on the solution to this legal issue, such as licensing rather than limitations and exceptions, led to representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.
  −
 
  −
2013年,欧盟委员会以“欧洲许可证”为题,推动了利益相关者对文本和数据挖掘的讨论。但他们将重点放在解决这一法律问题上,如许可证而不是限制和例外,导致大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表于2013年5月离开了利益相关者对话。
      
   --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】"如许可证而不是限制和例外,导致大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表于2013年5月离开了利益相关者对话。
 
   --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】"如许可证而不是限制和例外,导致大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表于2013年5月离开了利益相关者对话。
 
"改为“比如如何许可它而不是如何限制它或者把它作为一个例外,这使得大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表等利益相关者于2013年5月结束了讨论。”
 
"改为“比如如何许可它而不是如何限制它或者把它作为一个例外,这使得大学、研究人员、图书馆、民间社会团体和开放获取出版商的代表等利益相关者于2013年5月结束了讨论。”
   −
===美国 Situation in the United States===
+
===美国===
   −
 
+
美国版权法,特别是其中关于合理使用的条款,支持在美国和其他合理使用国家,如以色列,台湾和韩国内容挖掘的合法性。由于内容挖掘是变革性的,也就是说,它不会取代原来的工作,它被视为合法的合理使用。例如,作为谷歌图书和解协议的一部分,此案的主审法官裁定,谷歌版权图书数字化项目是合法的,部分原因在于数字化项目所展示的变革性用途——其中之一就是文本和数据挖掘。<ref>{{cite web|title=Judge grants summary judgment in favor of Google Books — a fair use victory|url=http://www.lexology.com/library/detail.aspx?g=a18c5b92-5a20-4d1d-a098-a3095046a88e|website=Lexology.com|publisher=Antonelli Law Ltd|accessdate=14 November 2014}}</ref>
 
  −
[[Copyright law of the United States|US copyright law]], and in particular its provision for [[fair use]], upholds the legality of content mining in America, and other fair use countries such as Israel, Taiwan and South Korea. As content mining is transformative, that is it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the [[Google Book Search Settlement Agreement|Google Book settlement]] the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed - one being text and data mining.<ref>{{cite web|title=Judge grants summary judgment in favor of Google Books — a fair use victory|url=http://www.lexology.com/library/detail.aspx?g=a18c5b92-5a20-4d1d-a098-a3095046a88e|website=Lexology.com|publisher=Antonelli Law Ltd|accessdate=14 November 2014}}</ref>
  −
 
  −
US copyright law, and in particular its provision for fair use, upholds the legality of content mining in America, and other fair use countries such as Israel, Taiwan and South Korea. As content mining is transformative, that is it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed - one being text and data mining.
  −
 
  −
美国版权法,特别是其中关于合理使用的条款,支持在美国和其他合理使用国家,如以色列,台湾和韩国采矿内容的合法性。由于内容挖掘是变革性的,也就是说,它不会取代原来的工作,它被视为合法的合理使用。例如,作为谷歌图书和解协议的一部分,此案的主审法官裁定,谷歌版权图书数字化项目是合法的,部分原因在于数字化项目所展示的变革性用途——其中之一就是文本和数据挖掘。
      
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】将“台湾和韩国采矿内容的合法性”改为“台湾和韩国内容挖掘的合法性”
 
   --[[用户:Zengsihang|Zengsihang]]([[用户讨论:Zengsihang|讨论]]) 【审校】将“台湾和韩国采矿内容的合法性”改为“台湾和韩国内容挖掘的合法性”
   −
==软件 Software==
+
==软件==
    
{{Category see also|Data mining and machine learning software}}
 
{{Category see also|Data mining and machine learning software}}
第374行: 第251行:       −
===开源的数据挖掘软件 Free open-source data mining software and applications===
+
===开源的数据挖掘软件===
 
  −
The following applications are available under free/open-source licenses. Public access to application source code is also available.
  −
 
  −
The following applications are available under free/open-source licenses. Public access to application source code is also available.
      
下面的应用程序可以使用免费 / 开源许可证。应用程序源代码也是对公众开放访问的。
 
下面的应用程序可以使用免费 / 开源许可证。应用程序源代码也是对公众开放访问的。
第384行: 第257行:       −
* [[Carrot2]]: Text and search results clustering framework.  文本和搜索结果聚类框架。
+
* 和搜索结果聚类框架。
   −
* [[Chemicalize.org]]: A chemical structure miner and web search engine.  化学结构挖掘与网络搜索引擎。
+
* 化学结构挖掘与网络搜索引擎。
   −
* [[ELKI]]: A university research project with advanced [[cluster analysis]] and [[anomaly detection|outlier detection]] methods written in the [[Java (programming language)|Java]] language.  一个大学研究项目,用Java语言编写高级聚类分析和离群点检测方法。
+
* 一个大学研究项目,用Java语言编写高级聚类分析和离群点检测方法。
   −
* [[General Architecture for Text Engineering|GATE]]: a [[natural language processing]] and language engineering tool.  一个自然语言处理和语言工程工具。
+
* 一个自然语言处理和语言工程工具。
   −
* [[KNIME]]: The Konstanz Information Miner, a user-friendly and comprehensive data analytics framework.  Konstanz Information Miner,一个用户友好的综合数据分析框架。
+
* 一个用户友好的综合数据分析框架。
   −
* [[MOA (Massive Online Analysis)|Massive Online Analysis (MOA)]]: a real-time big data stream mining with concept drift tool in the [[Java (programming language)|Java]] programming language.  利用Java语言中的概念漂移工具进行实时大数据流挖掘。
+
* 利用Java语言中的概念漂移工具进行实时大数据流挖掘。
   −
* [[Multi expression programming|MEPX]] - cross-platform tool for regression and classification problems based on a Genetic Programming variant.  基于遗传编程变量的回归和分类问题的跨平台工具。
+
* 基于遗传编程变量的回归和分类问题的跨平台工具。
   −
* ML-Flex: A software package that enables users to integrate with third-party machine-learning packages written in any programming language, execute classification analyses in parallel across multiple computing nodes, and produce HTML reports of classification results. 一种软件包,使用户能够与用任何编程语言编写的第三方机器学习包集成,跨多个计算节点并行执行分类分析,并生成分类结果的HTML报告。
+
*  一种软件包,使用户能够与用任何编程语言编写的第三方机器学习包集成,跨多个计算节点并行执行分类分析,并生成分类结果的HTML报告。
   −
* [[mlpack]]: a collection of ready-to-use machine learning algorithms written in the [[C++]] language. 一个用C++语言编写的机器学习算法的集合。
+
* 一个用C++语言编写的机器学习算法的集合。
   −
* [[NLTK]] ([[Natural Language Toolkit]]): A suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the [[Python (programming language)|Python]] language. 一套用于Python语言的符号和统计自然语言处理(NLP)的库和程序。
+
*  一套用于Python语言的符号和统计自然语言处理(NLP)的库和程序。
   −
* [[OpenNN]]: Open [[neural networks]] library.  开源的神经网络库。
+
* 开源的神经网络库。
   −
* [[Orange (software)|Orange]]: A component-based data mining and [[machine learning]] software suite written in the [[Python (programming language)|Python]] language. 一个用Python语言编写的基于组件的数据挖掘和机器学习软件套件。
+
* 一个用Python语言编写的基于组件的数据挖掘和机器学习软件套件。
   −
* [[R (programming language)|R]]: A [[programming language]] and software environment for [[statistics|statistical]] computing, data mining, and graphics. It is part of the [[GNU Project]]. 一种用于统计计算、数据挖掘和图形的编程语言和软件环境。它是GNU项目的一部分。
+
* 一种用于统计计算、数据挖掘和图形的编程语言和软件环境。它是GNU项目的一部分。
   −
* [[scikit-learn]] is an open-source machine learning library for the Python programming language 是Python编程语言的一个开源机器学习库
+
*  是Python编程语言的一个开源机器学习库
 +
 +
* 一个开源的深度学习lib库语言和科学计算框架,广泛支持机器学习算法。
   −
* [[Torch (machine learning)|Torch]]: An [[open source model|open-source]] [[deep learning]] library for the [[Lua (programming language)|Lua]] programming language and [[scientific computing]] framework with wide support for [[machine learning]] algorithms. 一个开源的深度学习lib库语言和科学计算框架,广泛支持机器学习算法。
+
UIMA(非结构化信息管理体系结构)是一个用于分析非结构化内容(如文本、音频和视频)的组件框架,最初由IBM开发。
   −
* [[UIMA]]: The UIMA (Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video – originally developed by IBM. UIMA(非结构化信息管理体系结构)是一个用于分析非结构化内容(如文本、音频和视频)的组件框架,最初由IBM开发。
+
用Java编程语言编写的一套机器学习软件应用程序。
   −
* [[Weka (machine learning)|Weka]]: A suite of machine learning software applications written in the [[Java (programming language)|Java]] programming language.  用Java编程语言编写的一套机器学习软件应用程序。
     −
 
+
===需要专有许可的数据挖掘软件和应用程序===
 
  −
===需要专有许可的数据挖掘软件和应用程序 Proprietary data-mining software and applications===
  −
 
  −
The following applications are available under proprietary licenses.
  −
 
  −
The following applications are available under proprietary licenses.
      
下面的应用程序可以根据专有许可证提供。
 
下面的应用程序可以根据专有许可证提供。
第430行: 第298行:       −
* [[Angoss]] KnowledgeSTUDIO: data mining tool  数据挖掘工具。
+
* 数据挖掘工具。
   −
* [[LIONsolver]]: an integrated software application for data mining, business intelligence, and modeling that implements the Learning and Intelligent OptimizatioN (LION) approach.  用于数据挖掘、商业智能和建模的集成软件应用程序,实现学习和智能优化(LION)方法。
+
* 用于数据挖掘、商业智能和建模的集成软件应用程序,实现学习和智能优化(LION)方法。
   −
* Megaputer Intelligence: data and text mining software is called PolyAnalyst.  数据和文本挖掘软件PolyAnalyst。
+
* 数据和文本挖掘软件PolyAnalyst。
   −
* [[Microsoft Analysis Services]]: data mining software provided by [[Microsoft]].  微软提供的数据挖掘软件
+
* 微软提供的数据挖掘软件
   −
* [[NetOwl]]: suite of multilingual text and entity analytics products that enable data mining.  支持数据挖掘的多语言文本和实体分析产品套件。
+
* 支持数据挖掘的多语言文本和实体分析产品套件。
   −
* [[Oracle Data Mining]]: data mining software by [[Oracle Corporation]].  Oracle公司的数据挖掘软件
+
* Oracle公司的数据挖掘软件
   −
* [[PSeven]]: platform for automation of engineering simulation and analysis, multidisciplinary optimization and data mining provided by [[DATADVANCE]].  DATADVANCE为工程仿真分析、多学科优化和数据挖掘提供自动化平台。
+
* DATADVANCE为工程仿真分析、多学科优化和数据挖掘提供自动化平台。
   −
* [[Qlucore]] Omics Explorer: data mining software.  数据挖掘软件。
+
* 数据挖掘软件。
   −
* [[RapidMiner]]: An environment for [[machine learning]] and data mining experiments. <!-- Latest version is NOT opensource -->  一个用于机器学习和数据挖掘实验的环境。
+
* 一个用于机器学习和数据挖掘实验的环境。
   −
* [[SAS (software)#Components|SAS Enterprise Miner]]: data mining software provided by the [[SAS Institute]].  SAS机构提供的数据挖掘软件。
+
* SAS机构提供的数据挖掘软件。
   −
* [[SPSS Modeler]]: data mining software provided by [[IBM]].  IBM提供的数据挖掘软件。
+
* IBM提供的数据挖掘软件。
   −
* [[STATISTICA]] Data Miner: data mining software provided by [[StatSoft]].  StatSoft提供的数据挖掘软件。
+
* StatSoft提供的数据挖掘软件。
   −
* [[Tanagra (machine learning)|Tanagra]]: Visualisation-oriented data mining software, also for teaching.  面向可视化的数据挖掘软件,也用于教学。
+
* 面向可视化的数据挖掘软件,也用于教学。
   −
* [[Vertica]]: data mining software provided by [[Hewlett-Packard]].  惠普提供的数据挖掘软件。
+
* 惠普提供的数据挖掘软件。
   −
==扩展链接 See also==
+
==扩展链接==
   −
; Methods
+
; 方法
 
  −
方法
      
{{columns-list|colwidth=22em|
 
{{columns-list|colwidth=22em|
   −
* [[Agent mining]]  主体挖掘
+
* 主体挖掘
   −
* [[Anomaly detection|Anomaly/outlier/change detection]]  异常/异常/变化检测
+
* 异常/异常/变化检测
   −
* [[Association rule learning]]  关联规则学习
+
* 关联规则学习
   −
* [[Bayesian network]]s  贝叶斯网络
+
* 贝叶斯网络
   −
* [[Statistical classification|Classification]]  分类
+
* 分类
   −
* [[Cluster analysis]]  聚类分析
+
* 聚类分析
   −
* [[Decision tree]]s  决策树
+
* 决策树
   −
* [[Ensemble learning]]  集成学习
+
* 集成学习
   −
* [[Factor analysis]]  因子分析
+
* 因子分析
   −
* [[Genetic algorithms]]  遗传算法
+
* 遗传算法
   −
* [[Intention mining]]  意向玩具
+
* 意向玩具
   −
* [[Learning classifier system]]  学习分类器系统
+
* 学习分类器系统
   −
* [[Multilinear subspace learning]]  多线性子空间学习
+
* 多线性子空间学习
   −
* [[Artificial neural network|Neural network]]s  神经网络
+
* 神经网络
   −
* [[Regression analysis]]  回归分析
+
* 回归分析
   −
* [[Sequence mining]]  序列挖掘
+
* 序列挖掘
   −
* [[Structured data analysis (statistics)|Structured data analysis]]  结构数据学习
+
* 结构数据学习
   −
* [[Support vector machines]]  支持向量机
+
* 支持向量机
   −
* [[Text mining]]  文本挖掘
+
* 文本挖掘
   −
* [[Time series|Time series analysis]]  时间序列分析
+
* 时间序列分析
    
}}
 
}}
   −
 
+
; 应用领域
 
  −
; Application domains
  −
 
  −
应用领域
      
{{columns-list|colwidth=22em|
 
{{columns-list|colwidth=22em|
   −
* [[Analytics]] 分析
+
* 分析
   −
* [[Behavior informatics]]  行为信息学
+
* 行为信息学
   −
* [[Big Data|Big data]]  大数据
+
* 大数据
   −
* [[Bioinformatics]]  生物信息学
+
* 生物信息学
   −
* [[Business intelligence]]  商务智能
+
* 商务智能
   −
* [[Data analysis]]  数据分析
+
* 数据分析
   −
* [[Data warehouse]]  数据仓库
+
* 数据仓库
   −
* [[Decision support system]]  决策支持系统
+
* 决策支持系统
   −
* [[Domain driven data mining]]  域驱动的数据挖掘
+
* 域驱动的数据挖掘
   −
* [[Drug discovery]]  药物发现
+
* 药物发现
   −
* [[Exploratory data analysis]]  探索性数据分析
+
* 探索性数据分析
   −
* [[Predictive analytics]]  预测分析
+
* 预测分析
   −
* [[Web mining]]  网页挖掘
+
* 网页挖掘
    
}}
 
}}
      −
 
+
; 应用示例
; Application examples
  −
 
  −
应用示例
      
{{Main|Examples of data mining}}
 
{{Main|Examples of data mining}}
第556行: 第415行:  
{{columns-list|colwidth=22em|=
 
{{columns-list|colwidth=22em|=
   −
*[[Automatic number plate recognition in the United Kingdom#Data mining|Automatic number plate recognition in the United Kingdom]]  英国的自动车牌识别
+
* 英国的自动车牌识别
   −
*[[Customer analytics#Data mining|Customer analytics]]  客户分析
+
* 客户分析
   −
*[[Educational data mining]]  教育数据挖掘
+
* 教育数据挖掘
   −
*[[National Security Agency#Transaction data mining|National Security Agency]]  国家安全局
+
* 国家安全局
   −
*[[Quantitative structure–activity relationship#Data mining approach|Quantitative structure–activity relationship]]  数量结构-活动关系
+
* 数量结构-活动关系
   −
*[[Surveillance#Data mining and profiling|Surveillance]] / [[Mass surveillance#Data mining|Mass surveillance]] (e.g., [[Stellar Wind (code name)|Stellar Wind]])
+
* 监控/大规模监测(例如,恒星风)
监控/大规模监测(例如,恒星风)
   
}}
 
}}
   −
 
+
; 相关话题
 
  −
; Related topics
  −
 
  −
相关话题
  −
 
  −
For more information about extracting information out of data (as opposed to ''analyzing'' data) , see:
  −
 
  −
For more information about extracting information out of data (as opposed to analyzing data) , see:
      
有关从数据中提取信息(与分析数据相反)的详细信息,请参阅:
 
有关从数据中提取信息(与分析数据相反)的详细信息,请参阅:
第584行: 第434行:  
{{columns-list|colwidth=22em|
 
{{columns-list|colwidth=22em|
   −
* [[Data integration]]  数据集成
+
* 数据集成
   −
* [[Data transformation]]  数据转换
+
* 数据转换
   −
* [[Electronic discovery]]  电子发现
+
* 电子发现
   −
* [[Information extraction]]  信息提取
+
* 信息提取
   −
* [[Information integration]]  信息集成
+
* 信息集成
   −
* [[Named-entity recognition]]  命名实体识别
+
* 命名实体识别
   −
* [[Profiling (information science)]]  分析(信息科学)
+
* 分析(信息科学)
   −
* [[Psychometrics]]  心理测量学
+
* 心理测量学
   −
* [[Social media mining]]  社交媒体挖掘
+
* 社交媒体挖掘
   −
* [[Surveillance capitalism]]  资本监视
+
* 资本监视
   −
* [[Web scraping]]  网页抓取
+
* 网页抓取
    
}}
 
}}
   −
;Other resources
+
; 其他资源
 
  −
Other resources
  −
 
  −
其他资源
     −
*[[International Journal of Data Warehousing and Mining]]  国际数据仓库与挖掘杂志
+
* 国际数据仓库与挖掘杂志
   −
==参考文献 References==
+
==参考文献 ==
    
{{Reflist|30em}}
 
{{Reflist|30em}}
   −
==进一步阅读 Further reading==
+
==进一步阅读==
    
{{div col|colwidth=30em}}
 
{{div col|colwidth=30em}}
863

个编辑

导航菜单