更改

删除12字节 、 2020年10月24日 (六) 15:40
第124行: 第124行:  
| isbn = 978-1-449-35865-5}}</ref> The CRISP framework used in data mining has similar steps.
 
| isbn = 978-1-449-35865-5}}</ref> The CRISP framework used in data mining has similar steps.
   −
数据分析的发展可以分为以下几个阶段,如下所述。这些阶段是<font color = '#ff8000'>迭代的iterative</font>,因为后期阶段的反馈可能会导致重复额外的与前期阶段相同的工作。
+
数据分析可以分为以下几个步骤,如下所述。这些阶段是<font color = '#ff8000'>迭代的iterative</font>,因为后期阶段的反馈可能会导致重复额外的与前期阶段相同的工作。
 
用于数据挖掘的 CRISP 框架有类似的步骤。
 
用于数据挖掘的 CRISP 框架有类似的步骤。
   第159行: 第159行:  
Data initially obtained must be processed or organised for analysis. For instance, these may involve placing data into rows and columns in a table format (i.e., structured data) for further analysis, such as within a spreadsheet or statistical software.
 
Data initially obtained must be processed or organised for analysis. For instance, these may involve placing data into rows and columns in a table format (i.e., structured data) for further analysis, such as within a spreadsheet or statistical software.
   −
一开始获得的数据必须经过处理整合以便进行分析。,以电子表格或统计软件为例,这可能涉及到将数据以表格格式放置到行和列中(即结构化数据)以便后续分析。
+
一开始获得的数据必须经过处理整合以便进行分析。以电子表格或统计软件为例,这可能涉及到将数据以表格格式放置到行和列中(即结构化数据)以便后续分析。
         −
===Data cleaning 数据清理===
+
===Data cleaning 数据整理===
    
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation.<ref>{{cite web|title=Data Cleaning|url=http://research.microsoft.com/en-us/projects/datacleaning/|publisher=Microsoft Research|accessdate=26 October 2013}}</ref> Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable.<ref name="Koomey1">[http://www.perceptualedge.com/articles/b-eye/quantitative_data.pdf Perceptual Edge-Jonathan Koomey-Best practices for understanding quantitative data-February 14, 2006]</ref> Unusual amounts above or below pre-determined thresholds may also be reviewed.  There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.<ref>{{cite journal|last=Hellerstein|first=Joseph|title=Quantitative Data Cleaning for Large Databases|journal=EECS Computer Science Division|date=27 February 2008|page=3|url=http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf|accessdate=26 October 2013}}</ref>
 
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation.<ref>{{cite web|title=Data Cleaning|url=http://research.microsoft.com/en-us/projects/datacleaning/|publisher=Microsoft Research|accessdate=26 October 2013}}</ref> Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable.<ref name="Koomey1">[http://www.perceptualedge.com/articles/b-eye/quantitative_data.pdf Perceptual Edge-Jonathan Koomey-Best practices for understanding quantitative data-February 14, 2006]</ref> Unusual amounts above or below pre-determined thresholds may also be reviewed.  There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.<ref>{{cite journal|last=Hellerstein|first=Joseph|title=Quantitative Data Cleaning for Large Databases|journal=EECS Computer Science Division|date=27 February 2008|page=3|url=http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf|accessdate=26 October 2013}}</ref>
526

个编辑