更改

跳到导航 跳到搜索
删除8字节 、 2020年10月24日 (六) 15:42
第163行: 第163行:       −
===Data cleaning 数据整理===
+
===Data cleaning 数据筛选===
    
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation.<ref>{{cite web|title=Data Cleaning|url=http://research.microsoft.com/en-us/projects/datacleaning/|publisher=Microsoft Research|accessdate=26 October 2013}}</ref> Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable.<ref name="Koomey1">[http://www.perceptualedge.com/articles/b-eye/quantitative_data.pdf Perceptual Edge-Jonathan Koomey-Best practices for understanding quantitative data-February 14, 2006]</ref> Unusual amounts above or below pre-determined thresholds may also be reviewed.  There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.<ref>{{cite journal|last=Hellerstein|first=Joseph|title=Quantitative Data Cleaning for Large Databases|journal=EECS Computer Science Division|date=27 February 2008|page=3|url=http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf|accessdate=26 October 2013}}</ref>
 
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation.<ref>{{cite web|title=Data Cleaning|url=http://research.microsoft.com/en-us/projects/datacleaning/|publisher=Microsoft Research|accessdate=26 October 2013}}</ref> Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable.<ref name="Koomey1">[http://www.perceptualedge.com/articles/b-eye/quantitative_data.pdf Perceptual Edge-Jonathan Koomey-Best practices for understanding quantitative data-February 14, 2006]</ref> Unusual amounts above or below pre-determined thresholds may also be reviewed.  There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.<ref>{{cite journal|last=Hellerstein|first=Joseph|title=Quantitative Data Cleaning for Large Databases|journal=EECS Computer Science Division|date=27 February 2008|page=3|url=http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf|accessdate=26 October 2013}}</ref>
第169行: 第169行:  
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation. Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable. Unusual amounts above or below pre-determined thresholds may also be reviewed.  There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.
 
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation. Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable. Unusual amounts above or below pre-determined thresholds may also be reviewed.  There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.
   −
一旦经过处理和组织,就会发现数据可能不完整、包含重复或包含错误。
+
一旦经过处理和组织,就会发现数据可能不完整、重复或含有错误。
 
   --[[用户:嘉树|嘉树]]([[用户讨论:嘉树|讨论]]) 根据语义增加“就会发现”如何如何
 
   --[[用户:嘉树|嘉树]]([[用户讨论:嘉树|讨论]]) 根据语义增加“就会发现”如何如何
由于输入和存储数据的方式的问题,因此我们需要进行数据清理。数据清理是预防和纠正这些错误的过程。常见的任务包括匹配记录、识别不准确的数据、<font color = '#32cd32'>监控</font>现有数据的整体质量、处理数据重复和分割列等。这样的数据问题也可以通过很多种分析技术来识别。例如,利用财务信息,可以将特定变量的与被所有数据认为可靠的单独公布的数字进行比较。高于或低于预先确定的阈值的异常数额可能会被核查。有几种类型的数据清理依赖于数据的类型,如电话号码,电子邮件地址,雇主等。<font color = '#ff8000'>异常值检查outlier detection</font>的定量方法可以用来去除可能的输入错误的数据。<font color = '#ff8000'>文本数据拼写检查器Textual data spell checkers</font>可以用来减少拼写错误单词的数量,但是很难判断这些单词本身是否正确。
+
由于输入和存储数据的方式的问题,因此我们需要进行数据清理。数据筛选是预防和纠正这些错误的过程。常见的任务包括匹配记录、识别不准确的数据、<font color = '#32cd32'>监控</font>现有数据的整体质量、处理数据重复和分割列等。这样的数据问题也可以通过很多种分析技术来识别。例如,利用财务信息,可以将特定变量的与被所有数据认为可靠的单独公布的数字进行比较。高于或低于预先确定的阈值的异常数额可能会被核查。有几种类型的数据清理依赖于数据的类型,如电话号码,电子邮件地址,雇主等。<font color = '#ff8000'>异常值检查outlier detection</font>的定量方法可以用来去除可能的输入错误的数据。<font color = '#ff8000'>文本数据拼写检查器Textual data spell checkers</font>可以用来减少拼写错误单词的数量,但是很难判断这些单词本身是否正确。
 
  −
 
      
===Exploratory data analysis 探索性数据分析===
 
===Exploratory data analysis 探索性数据分析===
526

个编辑

导航菜单