第10行: |
第10行: |
| | | |
| --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】 “数据预处理、模型和推理考虑、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。”一句改为“数据预处理、'''建模'''和推理'''考量'''、兴趣度量、'''<font color="#32CD32">复杂性考虑、发现结构的后处理</font>'''、可视化和在线更新等内容。” | | --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】 “数据预处理、模型和推理考虑、兴趣度量、复杂性考虑、发现结构的后处理、可视化和在线更新等内容。”一句改为“数据预处理、'''建模'''和推理'''考量'''、兴趣度量、'''<font color="#32CD32">复杂性考虑、发现结构的后处理</font>'''、可视化和在线更新等内容。” |
− |
| |
− |
| |
− | The term "data mining" is a [[misnomer]], because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (''mining'') of data itself. It also is a [[buzzword]]and is frequently applied to any form of large-scale data or [[information processing]] ([[Data collection|collection]], [[information extraction|extraction]], [[Data warehouse|warehousing]], analysis, and statistics) as well as any application of [[Decision support system|computer decision support system]], including [[artificial intelligence]] (e.g., machine learning) and [[business intelligence]]. The book ''Data mining: Practical machine learning tools and techniques with Java''(which covers mostly machine learning material) was originally to be named just ''Practical machine learning'', and the term ''data mining'' was only added for marketing reasons. Often the more general terms (''large scale'') ''[[data analysis]]'' and ''[[analytics]]'' – or, when referring to actual methods, ''artificial intelligence'' and ''machine learning'' – are more appropriate.
| |
− |
| |
− | The term "data mining" is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.
| |
| | | |
| “数据挖掘”这种形容其实并不'''太'''恰当,因为我们的目标是从大量数据中提取模式和知识,而不是数据本身的提取(挖掘)。<ref name="han-kamber">{{cite book|title=Data mining: concepts and techniques|last1=Han|first1=Jiawei|last2=Kamber|first2=Micheline|date=2001|publisher=[[Morgan Kaufmann]]|isbn=978-1-55860-489-6|page=5|quote=Thus, data mining should have been more appropriately named "knowledge mining from data," which is unfortunately somewhat long|authorlink1=Jiawei Han}}</ref>“它是一个经常被用于各种大规模数据或信息处理(收集、提取、存储、分析和统计),以及包括人工智能(例如机器学习)和商业智能的'''<font color="#ff8000"> 计算机决策系统 Decision Support System,DSS</font>'''等场合的流行语”<ref>[http://www.okairp.org/documents/2005%20Fall/F05_ROMEDataQualityETC.pdf OKAIRP 2005 Fall Conference, Arizona State University] {{Webarchive|url=https://web.archive.org/web/20140201170452/http://www.okairp.org/documents/2005%20Fall/F05_ROMEDataQualityETC.pdf|date=2014-02-01}}</ref>。 《数据挖掘:使用Java的实用机器学习工具和技术》<ref name="witten">{{cite book|title=Data Mining: Practical Machine Learning Tools and Techniques|last1=Witten|first1=Ian H.|last2=Frank|first2=Eibe|last3=Hall|first3=Mark A.|date=30 January 2011|publisher=Elsevier|isbn=978-0-12-374856-0|edition=3|authorlink1=Ian H. Witten}}</ref> (主要提供了一些机器学习的资料)一书最初被命名为《实用机器学习》,而数据挖掘一词只是为了销量更好而增加的。<ref>{{Cite journal|author1=Bouckaert, Remco R.|author2=Frank, Eibe|author3=Hall, Mark A.|author4=Holmes, Geoffrey|author5=Pfahringer, Bernhard|author6=Reutemann, Peter|author7=Witten, Ian H.|authorlink7=Ian H. Witten|year=2010|title=WEKA Experiences with a Java open-source project|journal=Journal of Machine Learning Research|volume=11|pages=2533–2541|quote=the original title, "Practical machine learning", was changed ... The term "data mining" was [added] primarily for marketing reasons.|postscript={{inconsistent citations}}}}</ref>经常来说,更一般的术语如(大规模)数据分析,或实际的方法如人工智能和机器学习,是更合适的表达方式。 | | “数据挖掘”这种形容其实并不'''太'''恰当,因为我们的目标是从大量数据中提取模式和知识,而不是数据本身的提取(挖掘)。<ref name="han-kamber">{{cite book|title=Data mining: concepts and techniques|last1=Han|first1=Jiawei|last2=Kamber|first2=Micheline|date=2001|publisher=[[Morgan Kaufmann]]|isbn=978-1-55860-489-6|page=5|quote=Thus, data mining should have been more appropriately named "knowledge mining from data," which is unfortunately somewhat long|authorlink1=Jiawei Han}}</ref>“它是一个经常被用于各种大规模数据或信息处理(收集、提取、存储、分析和统计),以及包括人工智能(例如机器学习)和商业智能的'''<font color="#ff8000"> 计算机决策系统 Decision Support System,DSS</font>'''等场合的流行语”<ref>[http://www.okairp.org/documents/2005%20Fall/F05_ROMEDataQualityETC.pdf OKAIRP 2005 Fall Conference, Arizona State University] {{Webarchive|url=https://web.archive.org/web/20140201170452/http://www.okairp.org/documents/2005%20Fall/F05_ROMEDataQualityETC.pdf|date=2014-02-01}}</ref>。 《数据挖掘:使用Java的实用机器学习工具和技术》<ref name="witten">{{cite book|title=Data Mining: Practical Machine Learning Tools and Techniques|last1=Witten|first1=Ian H.|last2=Frank|first2=Eibe|last3=Hall|first3=Mark A.|date=30 January 2011|publisher=Elsevier|isbn=978-0-12-374856-0|edition=3|authorlink1=Ian H. Witten}}</ref> (主要提供了一些机器学习的资料)一书最初被命名为《实用机器学习》,而数据挖掘一词只是为了销量更好而增加的。<ref>{{Cite journal|author1=Bouckaert, Remco R.|author2=Frank, Eibe|author3=Hall, Mark A.|author4=Holmes, Geoffrey|author5=Pfahringer, Bernhard|author6=Reutemann, Peter|author7=Witten, Ian H.|authorlink7=Ian H. Witten|year=2010|title=WEKA Experiences with a Java open-source project|journal=Journal of Machine Learning Research|volume=11|pages=2533–2541|quote=the original title, "Practical machine learning", was changed ... The term "data mining" was [added] primarily for marketing reasons.|postscript={{inconsistent citations}}}}</ref>经常来说,更一般的术语如(大规模)数据分析,或实际的方法如人工智能和机器学习,是更合适的表达方式。 |
第27行: |
第22行: |
| | | |
| --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】“而数据挖掘一词只是为了营销的原因而增加”改为“而数据挖掘一词只是为了销量更好而增加的” | | --[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]]) 【审校】“而数据挖掘一词只是为了营销的原因而增加”改为“而数据挖掘一词只是为了销量更好而增加的” |
− |
| |
− | The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records ([[cluster analysis]]), unusual records ([[anomaly detection]]), and dependencies ([[association rule mining]], [[sequential pattern mining]]). This usually involves using database techniques such as [[spatial index|spatial indices]]. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and [[predictive analytics]]. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a [[decision support system]]. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.
| |
− |
| |
− | The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.
| |
| | | |
| 实际的数据挖掘任务是对大量数据进行半自动或全自动分析,以发掘从前未知的且新奇的模式,如数据记录组(数据聚类)、异常记录组(异常检测)和依赖关系(关联规则挖掘,序列挖掘)。这通常涉及使用数据库技术,如空间索引。这些模式可以被看作是输入数据的一种汇总,并且可以用于进一步的分析,例如机器学习和预测分析。例如,数据挖掘的过程中可以把数据分成多个组,然后可以使用该步骤通过决策支持系统获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分,而是作为附加步骤属于整个 KDD 过程。 | | 实际的数据挖掘任务是对大量数据进行半自动或全自动分析,以发掘从前未知的且新奇的模式,如数据记录组(数据聚类)、异常记录组(异常检测)和依赖关系(关联规则挖掘,序列挖掘)。这通常涉及使用数据库技术,如空间索引。这些模式可以被看作是输入数据的一种汇总,并且可以用于进一步的分析,例如机器学习和预测分析。例如,数据挖掘的过程中可以把数据分成多个组,然后可以使用该步骤通过决策支持系统获得更准确的预测结果。数据收集、数据准备、结果解释和报告都不是数据挖掘步骤的一部分,而是作为附加步骤属于整个 KDD 过程。 |
第41行: |
第32行: |
| | | |
| '''<font color="#ff8000">数据分析 Data Analysis</font>'''和数据挖掘的区别在于,数据分析用于测试数据集上的模型和假设,例如,分析营销活动的有效性,而不是考虑数据量的多少;相反,数据挖掘使用机器学习和统计模型来发现“大量”数据中的秘密和隐藏的模式。 | | '''<font color="#ff8000">数据分析 Data Analysis</font>'''和数据挖掘的区别在于,数据分析用于测试数据集上的模型和假设,例如,分析营销活动的有效性,而不是考虑数据量的多少;相反,数据挖掘使用机器学习和统计模型来发现“大量”数据中的秘密和隐藏的模式。 |
− |
| |
− | The related terms ''[[data dredging]]'', ''data fishing'', and ''data snooping'' refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.
| |
− |
| |
− | The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.
| |
| | | |
| 相关术语'''<font color="#ff8000">“数据疏浚” Data Dredging</font>'''、“数据钓鱼”和“数据窥探”是指使用数据挖掘的方法对较大规模的数据集中的一部分进行抽样,这些数据集可能太小,无法可靠统计推断发现模式的有效性。但是,这些方法可以用于提出新的假设,以针对更大的数据群体进行测试。 | | 相关术语'''<font color="#ff8000">“数据疏浚” Data Dredging</font>'''、“数据钓鱼”和“数据窥探”是指使用数据挖掘的方法对较大规模的数据集中的一部分进行抽样,这些数据集可能太小,无法可靠统计推断发现模式的有效性。但是,这些方法可以用于提出新的假设,以针对更大的数据群体进行测试。 |