第3行: |
第3行: |
| |description=数据科学,数据挖掘,形式科学 | | |description=数据科学,数据挖掘,形式科学 |
| }} | | }} |
− | 数据挖掘是一种在大型数据集中发现模式的过程,用到了机器学习、统计学和数据库系统的交叉方法。<ref name="acm">{{cite web |url=http://www.kdd.org/curriculum/index.html |title=Data Mining Curriculum |publisher=[[Association for Computing Machinery|ACM]] [[SIGKDD]] |date=2006-04-30 |accessdate=2014-01-27 }}</ref><ref name="brittanica">{{cite web |last=Clifton |first=Christopher |title=Encyclopædia Britannica: Definition of Data Mining |year=2010 |url=http://www.britannica.com/EBchecked/topic/1056150/data-mining |accessdate=2010-12-09 }}</ref><ref name="elements">{{cite web|last1=Hastie|first1=Trevor|authorlink1=Trevor Hastie|last2=Tibshirani|first2=Robert|authorlink2=Robert Tibshirani|last3=Friedman|first3=Jerome|authorlink3=Jerome H. Friedman|title=The Elements of Statistical Learning: Data Mining, Inference, and Prediction|year=2009|url=http://www-stat.stanford.edu/~tibs/ElemStatLearn/|accessdate=2012-08-07|archive-url=https://web.archive.org/web/20091110212529/http://www-stat.stanford.edu/~tibs/ElemStatLearn/|archive-date=2009-11-10|url-status=dead}}</ref><ref>{{cite book|last1=Han, Kamber, Pei|first1=Jaiwei, Micheline, Jian|title=Data Mining: Concepts and Techniques|date=June 9, 2011|publisher=Morgan Kaufmann|isbn=978-0-12-381479-1|edition=3rd}}</ref>数据挖掘是指“'''知识发现 knowledge discovery in databases(KDD)'''”过程中的分析步骤。除了传统的分析步骤,它还涉及数据库和数据管理方面,包括“数据预处理、建模和推理考量、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。” | + | 数据挖掘是一种在大型数据集中发现模式的过程,用到了机器学习、统计学和数据库系统的交叉方法。<ref name="acm">{{cite web |url=http://www.kdd.org/curriculum/index.html |title=Data Mining Curriculum |publisher=Association for Computing Machinery| SIGKDD |date=2006-04-30 |accessdate=2014-01-27 }}</ref><ref name="brittanica">{{cite web |last=Clifton |first=Christopher |title=Encyclopædia Britannica: Definition of Data Mining |year=2010 |url=http://www.britannica.com/EBchecked/topic/1056150/data-mining |accessdate=2010-12-09 }}</ref><ref name="elements">{{cite web|last1=Hastie|first1=Trevor|last2=Tibshirani|first2=Robert|last3=Friedman|first3=Jerome|title=The Elements of Statistical Learning: Data Mining, Inference, and Prediction|year=2009|url=http://www-stat.stanford.edu/~tibs/ElemStatLearn/|accessdate=2012-08-07|archive-url=https://web.archive.org/web/20091110212529/http://www-stat.stanford.edu/~tibs/ElemStatLearn/|archive-date=2009-11-10|url-status=dead}}</ref><ref>{{cite book|last1=Han, Kamber, Pei|first1=Jaiwei, Micheline, Jian|title=Data Mining: Concepts and Techniques|date=June 9, 2011|publisher=Morgan Kaufmann|isbn=978-0-12-381479-1|edition=3rd}}</ref>数据挖掘是指“'''知识发现 knowledge discovery in databases(KDD)'''”过程中的分析步骤。除了传统的分析步骤,它还涉及数据库和数据管理方面,包括“数据预处理、建模和推理考量、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。” |
| | | |
| “数据挖掘”这种形容其实并不太恰当,因为我们的目标是从大量数据中提取模式和知识,而不是数据本身的提取(挖掘)。<ref name="han-kamber">{{cite book|title=Data mining: concepts and techniques|last1=Han|first1=Jiawei|last2=Kamber|first2=Micheline|date=2001|publisher=Morgan Kaufmann|isbn=978-1-55860-489-6|page=5|quote=Thus, data mining should have been more appropriately named "knowledge mining from data," which is unfortunately somewhat long|authorlink1=Jiawei Han}}</ref>“它是一个经常被用于各种大规模数据或信息处理(收集、提取、存储、分析和统计),以及包括人工智能(例如机器学习)和商业智能的'''<font color="#ff8000"> 计算机决策系统 Decision Support System,DSS</font>'''等场合的流行语”<ref>[http://www.okairp.org/documents/2005%20Fall/F05_ROMEDataQualityETC.pdf OKAIRP 2005 Fall Conference, Arizona State University] {{Webarchive|url=https://web.archive.org/web/20140201170452/http://www.okairp.org/documents/2005%20Fall/F05_ROMEDataQualityETC.pdf|date=2014-02-01}}</ref>。 《数据挖掘:使用Java的实用机器学习工具和技术》<ref name="witten">{{cite book|title=Data Mining: Practical Machine Learning Tools and Techniques|last1=Witten|first1=Ian H.|last2=Frank|first2=Eibe|last3=Hall|first3=Mark A.|date=30 January 2011|publisher=Elsevier|isbn=978-0-12-374856-0|edition=3|authorlink1=Ian H. Witten}}</ref> (主要提供了一些机器学习的资料)一书最初被命名为《实用机器学习》,而数据挖掘一词只是为了销量更好而增加的。<ref>{{Cite journal|author1=Bouckaert, Remco R.|author2=Frank, Eibe|author3=Hall, Mark A.|author4=Holmes, Geoffrey|author5=Pfahringer, Bernhard|author6=Reutemann, Peter|author7=Witten, Ian H.|authorlink7=Ian H. Witten|year=2010|title=WEKA Experiences with a Java open-source project|journal=Journal of Machine Learning Research|volume=11|pages=2533–2541|quote=the original title, "Practical machine learning", was changed ... The term "data mining" was [added] primarily for marketing reasons.|postscript={{inconsistent citations}}}}</ref>经常来说,更一般的术语如(大规模)数据分析,或实际的方法如人工智能和机器学习,是更合适的表达方式。 | | “数据挖掘”这种形容其实并不太恰当,因为我们的目标是从大量数据中提取模式和知识,而不是数据本身的提取(挖掘)。<ref name="han-kamber">{{cite book|title=Data mining: concepts and techniques|last1=Han|first1=Jiawei|last2=Kamber|first2=Micheline|date=2001|publisher=Morgan Kaufmann|isbn=978-1-55860-489-6|page=5|quote=Thus, data mining should have been more appropriately named "knowledge mining from data," which is unfortunately somewhat long|authorlink1=Jiawei Han}}</ref>“它是一个经常被用于各种大规模数据或信息处理(收集、提取、存储、分析和统计),以及包括人工智能(例如机器学习)和商业智能的'''<font color="#ff8000"> 计算机决策系统 Decision Support System,DSS</font>'''等场合的流行语”<ref>[http://www.okairp.org/documents/2005%20Fall/F05_ROMEDataQualityETC.pdf OKAIRP 2005 Fall Conference, Arizona State University] {{Webarchive|url=https://web.archive.org/web/20140201170452/http://www.okairp.org/documents/2005%20Fall/F05_ROMEDataQualityETC.pdf|date=2014-02-01}}</ref>。 《数据挖掘:使用Java的实用机器学习工具和技术》<ref name="witten">{{cite book|title=Data Mining: Practical Machine Learning Tools and Techniques|last1=Witten|first1=Ian H.|last2=Frank|first2=Eibe|last3=Hall|first3=Mark A.|date=30 January 2011|publisher=Elsevier|isbn=978-0-12-374856-0|edition=3|authorlink1=Ian H. Witten}}</ref> (主要提供了一些机器学习的资料)一书最初被命名为《实用机器学习》,而数据挖掘一词只是为了销量更好而增加的。<ref>{{Cite journal|author1=Bouckaert, Remco R.|author2=Frank, Eibe|author3=Hall, Mark A.|author4=Holmes, Geoffrey|author5=Pfahringer, Bernhard|author6=Reutemann, Peter|author7=Witten, Ian H.|authorlink7=Ian H. Witten|year=2010|title=WEKA Experiences with a Java open-source project|journal=Journal of Machine Learning Research|volume=11|pages=2533–2541|quote=the original title, "Practical machine learning", was changed ... The term "data mining" was [added] primarily for marketing reasons.|postscript={{inconsistent citations}}}}</ref>经常来说,更一般的术语如(大规模)数据分析,或实际的方法如人工智能和机器学习,是更合适的表达方式。 |
第50行: |
第50行: |
| #结果验证。 | | #结果验证。 |
| | | |
− | 2002、2004、2007、2014年的调查显示,CRISP-DM标准是数据挖掘者最常用的标准,在这些调查中,唯一使用的其他数据挖掘标准是SEMMA<ref>[[Gregory Piatetsky-Shapiro]] (2002) [http://www.kdnuggets.com/polls/2002/methodology.htm ''KDnuggets Methodology Poll''], [[Gregory Piatetsky-Shapiro]] (2004) [http://www.kdnuggets.com/polls/2004/data_mining_methodology.htm ''KDnuggets Methodology Poll''], [[Gregory Piatetsky-Shapiro]] (2007) [http://www.kdnuggets.com/polls/2007/data_mining_methodology.htm ''KDnuggets Methodology Poll''], [[Gregory Piatetsky-Shapiro]] (2014) [http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html ''KDnuggets Methodology Poll'']</ref>。然而,使用CRISP-DM的人数是其3-4倍。一些研究小组已经发表了关于数据挖掘过程模型的研究,例如阿泽维多 Azevedo和 桑托斯Santos曾在2008年对CRISP-DM和SEMMA这两套数据挖掘流程标准进行了比较。<ref name="AzevedoSantos">Azevedo, A. and Santos, M. F. [http://www.iadis.net/dl/final_uploads/200812P033.pdf KDD, SEMMA and CRISP-DM: a parallel overview] {{webarchive|url=https://web.archive.org/web/20130109114939/http://www.iadis.net/dl/final_uploads/200812P033.pdf |date=2013-01-09 }}. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182–185.</ref> | + | 2002、2004、2007、2014年的调查显示,CRISP-DM标准是数据挖掘者最常用的标准,在这些调查中,唯一使用的其他数据挖掘标准是SEMMA<ref>Gregory Piatetsky-Shapiro (2002) [http://www.kdnuggets.com/polls/2002/methodology.htm ''KDnuggets Methodology Poll''], Gregory Piatetsky-Shapiro (2004) [http://www.kdnuggets.com/polls/2004/data_mining_methodology.htm ''KDnuggets Methodology Poll''], [[Gregory Piatetsky-Shapiro]] (2007) [http://www.kdnuggets.com/polls/2007/data_mining_methodology.htm ''KDnuggets Methodology Poll''], Gregory Piatetsky-Shapiro(2014) [http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html ''KDnuggets Methodology Poll'']</ref>。然而,使用CRISP-DM的人数是其3-4倍。一些研究小组已经发表了关于数据挖掘过程模型的研究,例如阿泽维多 Azevedo和 桑托斯Santos曾在2008年对CRISP-DM和SEMMA这两套数据挖掘流程标准进行了比较。<ref name="AzevedoSantos">Azevedo, A. and Santos, M. F. [http://www.iadis.net/dl/final_uploads/200812P033.pdf KDD, SEMMA and CRISP-DM: a parallel overview] {{webarchive|url=https://web.archive.org/web/20130109114939/http://www.iadis.net/dl/final_uploads/200812P033.pdf |date=2013-01-09 }}. In Proceedings of the IADIS European Conference on Data Mining 2008, pp 182–185.</ref> |
| | | |
| ===预处理=== | | ===预处理=== |
第76行: |
第76行: |
| ==研究== | | ==研究== |
| | | |
− | 该领域的首要专业机构是计算机协会 ACM的知识发现和数据挖掘特别兴趣小组 SIGKDD。<ref>{{cite web|url=http://academic.research.microsoft.com/?SearchDomain=2&SubDomain=7&entitytype=2|title=Microsoft Academic Search: Top conferences in data mining | publisher=[[Microsoft Academic Search]]}}</ref><ref>{{cite web|url=https://scholar.google.de/citations?view_op=top_venues&hl=en&vq=eng_datamininganalysis|title=Google Scholar: Top publications - Data Mining & Analysis|publisher=[[Google Scholar]]}}</ref>自1989年以来,ACM SIG每年举办一次国际会议并出版会议记录<ref>[http://www.kdd.org/conferences.php Proceedings] {{Webarchive|url=https://web.archive.org/web/20100430120252/http://www.kdd.org/conferences.php |date=2010-04-30 }}, International Conferences on Knowledge Discovery and Data Mining, ACM, New York.</ref>,自1999年起,它还出版了一份名为“SIGKDD探索”的两年期学术期刊<ref>[http://www.kdd.org/explorations/about.php SIGKDD Explorations], ACM, New York.</ref>。 | + | 该领域的首要专业机构是计算机协会 ACM的知识发现和数据挖掘特别兴趣小组 SIGKDD。<ref>{{cite web|url=http://academic.research.microsoft.com/?SearchDomain=2&SubDomain=7&entitytype=2|title=Microsoft Academic Search: Top conferences in data mining | publisher=Microsoft Academic Search}}</ref><ref>{{cite web|url=https://scholar.google.de/citations?view_op=top_venues&hl=en&vq=eng_datamininganalysis|title=Google Scholar: Top publications - Data Mining & Analysis|publisher=Google Scholar}}</ref>自1989年以来,ACM SIG每年举办一次国际会议并出版会议记录<ref>[http://www.kdd.org/conferences.php Proceedings] {{Webarchive|url=https://web.archive.org/web/20100430120252/http://www.kdd.org/conferences.php |date=2010-04-30 }}, International Conferences on Knowledge Discovery and Data Mining, ACM, New York.</ref>,自1999年起,它还出版了一份名为“SIGKDD探索”的两年期学术期刊<ref>[http://www.kdd.org/explorations/about.php SIGKDD Explorations], ACM, New York.</ref>。 |
| | | |
| | | |
第107行: |
第107行: |
| | | |
| | | |
− | 数据挖掘需要进行数据准备,以发现损害机密性和隐私义务的信息或模式。实现这一点的一种常见方式是通过'''<font color="#ff8000">数据聚合 Data Aggregation</font>'''。<ref name="NASCIO">[http://www.nascio.org/publications/documents/NASCIO-dataMining.pdf ''Think Before You Dig: Privacy Implications of Data Mining & Aggregation''] {{webarchive|url=https://web.archive.org/web/20081217063043/http://www.nascio.org/publications/documents/NASCIO-dataMining.pdf |date=2008-12-17 }}, NASCIO Research Brief, September 2004</ref> 数据聚合包括以一种便于分析的方式将数据(可能来自不同的来源)组合在一起(但这也可能使私人、个人级别的数据识别变得可推断或以其他方式显而易见)。但这并不是数据挖掘导致的,而是在分析之前以及为分析目的准备数据的结果。当数据被编译后,数据挖掘者或任何有权访问新编译的数据集的人能够识别特定的个人,特别是当数据最初是匿名的时,就会对个人隐私产生威胁。<ref>{{cite magazine |first=Paul |last=Ohm |title=Don't Build a Database of Ruin |magazine=Harvard Business Review |url=http://blogs.hbr.org/cs/2012/08/dont_build_a_database_of_ruin.html}}</ref><ref>Darwin Bond-Graham, [http://www.counterpunch.org/2013/12/03/iron-cagebook/ Iron Cagebook - The Logical End of Facebook's Patents], [[Counterpunch.org]], 2013.12.03</ref><ref>Darwin Bond-Graham, [http://www.counterpunch.org/2013/09/11/inside-the-tech-industrys-startup-conference/ Inside the Tech industry's Startup Conference], [[Counterpunch.org]], 2013.09.11</ref> | + | 数据挖掘需要进行数据准备,以发现损害机密性和隐私义务的信息或模式。实现这一点的一种常见方式是通过'''<font color="#ff8000">数据聚合 Data Aggregation</font>'''。<ref name="NASCIO">[http://www.nascio.org/publications/documents/NASCIO-dataMining.pdf ''Think Before You Dig: Privacy Implications of Data Mining & Aggregation''] {{webarchive|url=https://web.archive.org/web/20081217063043/http://www.nascio.org/publications/documents/NASCIO-dataMining.pdf |date=2008-12-17 }}, NASCIO Research Brief, September 2004</ref> 数据聚合包括以一种便于分析的方式将数据(可能来自不同的来源)组合在一起(但这也可能使私人、个人级别的数据识别变得可推断或以其他方式显而易见)。但这并不是数据挖掘导致的,而是在分析之前以及为分析目的准备数据的结果。当数据被编译后,数据挖掘者或任何有权访问新编译的数据集的人能够识别特定的个人,特别是当数据最初是匿名的时,就会对个人隐私产生威胁。<ref>{{cite magazine |first=Paul |last=Ohm |title=Don't Build a Database of Ruin |magazine=Harvard Business Review |url=http://blogs.hbr.org/cs/2012/08/dont_build_a_database_of_ruin.html}}</ref><ref>Darwin Bond-Graham, [http://www.counterpunch.org/2013/12/03/iron-cagebook/ Iron Cagebook - The Logical End of Facebook's Patents], Counterpunch.org, 2013.12.03</ref><ref>Darwin Bond-Graham, [http://www.counterpunch.org/2013/09/11/inside-the-tech-industrys-startup-conference/ Inside the Tech industry's Startup Conference], Counterpunch.org, 2013.09.11</ref> |
| | | |
| 在收集数据之前,建议注意以下事项: | | 在收集数据之前,建议注意以下事项: |
第366行: |
第366行: |
| * Cabena, Peter; Hadjnian, Pablo; Stadler, Rolf; Verhees, Jaap; Zanasi, Alessandro (1997); ''Discovering Data Mining: From Concept to Implementation'', [[Prentice Hall]]. | | * Cabena, Peter; Hadjnian, Pablo; Stadler, Rolf; Verhees, Jaap; Zanasi, Alessandro (1997); ''Discovering Data Mining: From Concept to Implementation'', [[Prentice Hall]]. |
| | | |
− | * M.S. Chen, J. Han, [[Philip S. Yu|P.S. Yu]] (1996) "[http://cs.nju.edu.cn/zhouzh/zhouzh.files/course/dm/reading/reading01/chen_tkde96.pdf Data mining: an overview from a database perspective]". ''Knowledge and data Engineering, IEEE Transactions'' on 8 (6), 866–883 | + | * M.S. Chen, J. Han, Philip S. Yu (1996) "[http://cs.nju.edu.cn/zhouzh/zhouzh.files/course/dm/reading/reading01/chen_tkde96.pdf Data mining: an overview from a database perspective]". ''Knowledge and data Engineering, IEEE Transactions'' on 8 (6), 866–883 |
| | | |
| * Feldman, Ronen; Sanger, James (2007); ''The Text Mining Handbook'', [[Cambridge University Press]]. | | * Feldman, Ronen; Sanger, James (2007); ''The Text Mining Handbook'', [[Cambridge University Press]]. |
第372行: |
第372行: |
| * Guo, Yike; and Grossman, Robert (editors) (1999); ''High Performance Data Mining: Scaling Algorithms, Applications and Systems'', [[Kluwer Academic Publishers]] | | * Guo, Yike; and Grossman, Robert (editors) (1999); ''High Performance Data Mining: Scaling Algorithms, Applications and Systems'', [[Kluwer Academic Publishers]] |
| | | |
− | * [[Jiawei Han|Han, Jiawei]], Micheline Kamber, and Jian Pei. ''Data mining: concepts and techniques''. Morgan kaufmann, 2006. | + | * Jiawei Han, Micheline Kamber, and Jian Pei. ''Data mining: concepts and techniques''. Morgan kaufmann, 2006. |
| | | |
− | * [[Trevor Hastie|Hastie, Trevor]], [[Robert Tibshirani|Tibshirani, Robert]] and [[Jerome H. Friedman|Friedman, Jerome]] (2001); ''The Elements of Statistical Learning: Data Mining, Inference, and Prediction'', Springer. | + | * Trevor Hastie, [[Robert Tibshirani|Tibshirani, Robert]] and [[Jerome H. Friedman|Friedman, Jerome]] (2001); ''The Elements of Statistical Learning: Data Mining, Inference, and Prediction'', Springer. |
| | | |
− | * [[Bing Liu (computer scientist)|Liu, Bing]] (2007, 2011); ''Web Data Mining: Exploring Hyperlinks, Contents and Usage Data'', [[Springer Verlag|Springer]]. | + | * Bing Liu (computer scientist) (2007, 2011); ''Web Data Mining: Exploring Hyperlinks, Contents and Usage Data'', [[Springer Verlag|Springer]]. |
| | | |
− | * {{cite journal |last=Murphy |first=Chris |date=16 May 2011 |title=Is Data Mining Free Speech? |journal=[[InformationWeek]] |page=12 }} | + | * {{cite journal |last=Murphy |first=Chris |date=16 May 2011 |title=Is Data Mining Free Speech? |journal=InformationWeek |page=12 }} |
| | | |
| * Nisbet, Robert; Elder, John; Miner, Gary (2009); ''Handbook of Statistical Analysis & Data Mining Applications'', [[Academic Press]]/Elsevier. | | * Nisbet, Robert; Elder, John; Miner, Gary (2009); ''Handbook of Statistical Analysis & Data Mining Applications'', [[Academic Press]]/Elsevier. |
第390行: |
第390行: |
| * Weiss, Sholom M.; and Indurkhya, Nitin (1998); ''Predictive Data Mining'', [[Morgan Kaufmann]] | | * Weiss, Sholom M.; and Indurkhya, Nitin (1998); ''Predictive Data Mining'', [[Morgan Kaufmann]] |
| | | |
− | * {{cite book |author1=Witten, Ian H.|authorlink1=Ian H. Witten |author2=Frank, Eibe |author3=Hall, Mark A. |title=Data Mining: Practical Machine Learning Tools and Techniques |edition=3 |date=30 January 2011 |publisher=Elsevier |isbn=978-0-12-374856-0 }} (See also [[Weka (machine learning)|Free Weka software]]) | + | * {{cite book |author1=Witten, Ian H.|author2=Frank, Eibe |author3=Hall, Mark A. |title=Data Mining: Practical Machine Learning Tools and Techniques |edition=3 |date=30 January 2011 |publisher=Elsevier |isbn=978-0-12-374856-0 }} (See also [[Weka (machine learning)|Free Weka software]]) |
| | | |
| * Ye, Nong (2003); ''The Handbook of Data Mining'', Mahwah, NJ: Lawrence Erlbaum | | * Ye, Nong (2003); ''The Handbook of Data Mining'', Mahwah, NJ: Lawrence Erlbaum |