更改

删除2,333字节 、 2020年10月27日 (二) 09:06
第56行: 第56行:  
===数据筛选===
 
===数据筛选===
   −
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation.<ref>{{cite web|title=Data Cleaning|url=http://research.microsoft.com/en-us/projects/datacleaning/|publisher=Microsoft Research|accessdate=26 October 2013}}</ref> Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable.<ref name="Koomey1">[http://www.perceptualedge.com/articles/b-eye/quantitative_data.pdf Perceptual Edge-Jonathan Koomey-Best practices for understanding quantitative data-February 14, 2006]</ref> Unusual amounts above or below pre-determined thresholds may also be reviewed.  There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.<ref>{{cite journal|last=Hellerstein|first=Joseph|title=Quantitative Data Cleaning for Large Databases|journal=EECS Computer Science Division|date=27 February 2008|page=3|url=http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf|accessdate=26 October 2013}}</ref>
     −
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation. Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable. Unusual amounts above or below pre-determined thresholds may also be reviewed. There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.
+
一旦经过处理和组织,就会发现数据可能不完整、重复或含有错误。由于输入和存储数据的方式的问题,因此我们需要进行数据清理。数据筛选是预防和纠正这些错误的过程。常见的任务包括匹配记录、识别不准确的数据、<font color = '#32cd32'>监控</font>现有数据的整体质量、处理数据重复和分割列等。<ref>{{cite web|title=Data Cleaning|url=http://research.microsoft.com/en-us/projects/datacleaning/|publisher=Microsoft Research|accessdate=26 October 2013}}</ref>这样的数据问题也可以通过很多种分析技术来识别。例如,利用财务信息,可以将特定变量的与被所有数据认为可靠的单独公布的数字进行比较。<ref name="Koomey1">[http://www.perceptualedge.com/articles/b-eye/quantitative_data.pdf Perceptual Edge-Jonathan Koomey-Best practices for understanding quantitative data-February 14, 2006]</ref> 高于或低于预先确定的阈值的异常数额可能会被核查。有几种类型的数据清理依赖于数据的类型,如电话号码,电子邮件地址,雇主等。<font color = '#ff8000'>异常值检查outlier detection</font>的定量方法可以用来去除可能的输入错误的数据。<font color = '#ff8000'>文本数据拼写检查器Textual data spell checkers</font>可以用来减少拼写错误单词的数量,但是很难判断这些单词本身是否正确。<ref>{{cite journal|last=Hellerstein|first=Joseph|title=Quantitative Data Cleaning for Large Databases|journal=EECS Computer Science Division|date=27 February 2008|page=3|url=http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf|accessdate=26 October 2013}}</ref>
 
  −
一旦经过处理和组织,就会发现数据可能不完整、重复或含有错误。
  −
  --[[用户:嘉树|嘉树]]([[用户讨论:嘉树|讨论]]) 根据语义增加“就会发现”如何如何
  −
由于输入和存储数据的方式的问题,因此我们需要进行数据清理。数据筛选是预防和纠正这些错误的过程。常见的任务包括匹配记录、识别不准确的数据、<font color = '#32cd32'>监控</font>现有数据的整体质量、处理数据重复和分割列等。这样的数据问题也可以通过很多种分析技术来识别。例如,利用财务信息,可以将特定变量的与被所有数据认为可靠的单独公布的数字进行比较。高于或低于预先确定的阈值的异常数额可能会被核查。有几种类型的数据清理依赖于数据的类型,如电话号码,电子邮件地址,雇主等。<font color = '#ff8000'>异常值检查outlier detection</font>的定量方法可以用来去除可能的输入错误的数据。<font color = '#ff8000'>文本数据拼写检查器Textual data spell checkers</font>可以用来减少拼写错误单词的数量,但是很难判断这些单词本身是否正确。
      
===探索性数据分析===
 
===探索性数据分析===