更改

添加915字节 、 2020年8月30日 (日) 15:13
无编辑摘要
第37行: 第37行:  
Data science process flowchart from Doing Data Science, by Schutt & O'Neil (2013)
 
Data science process flowchart from Doing Data Science, by Schutt & O'Neil (2013)
   −
数据科学处理流程图,来自《'''<font color='#ff8000'>做数据科学</font>'''》 ,Schutt & o’ neil (2013)
+
数据科学处理流程图,来自《'''<font color='#ff8000'>做数据科学 Doing Data Science</font>'''》 ,Schutt & o’ neil (2013)
    
Analysis refers to breaking a whole into its separate components for individual examination. Data analysis is a [[Process theory|process]] for obtaining raw data and converting it into information useful for decision-making by users. Data is collected and analyzed to answer questions, test hypotheses or disprove theories.<ref name="Judd and McClelland 1989">{{cite book
 
Analysis refers to breaking a whole into its separate components for individual examination. Data analysis is a [[Process theory|process]] for obtaining raw data and converting it into information useful for decision-making by users. Data is collected and analyzed to answer questions, test hypotheses or disprove theories.<ref name="Judd and McClelland 1989">{{cite book
第43行: 第43行:  
Analysis refers to breaking a whole into its separate components for individual examination. Data analysis is a process for obtaining raw data and converting it into information useful for decision-making by users. Data is collected and analyzed to answer questions, test hypotheses or disprove theories.<ref name="Judd and McClelland 1989">{{cite book
 
Analysis refers to breaking a whole into its separate components for individual examination. Data analysis is a process for obtaining raw data and converting it into information useful for decision-making by users. Data is collected and analyzed to answer questions, test hypotheses or disprove theories.<ref name="Judd and McClelland 1989">{{cite book
   −
分析是指将一个整体分解成独立的部分进行个别检查。数据分析是获取原始数据并将其转化为用户决策有用信息的过程。数据被收集和分析来回答问题,检验假设或推翻理论
+
分析是指将一个整体分解成独立的部分来进行个别检查。数据分析是获取原始数据并将其转化为用户决策有用信息的过程。数据被收集和分析,从而回答问题、检验假设或推翻理论。
    
| last = Judd, Charles and
 
| last = Judd, Charles and
第49行: 第49行:  
| last = Judd, Charles and
 
| last = Judd, Charles and
   −
最后贾德,查尔斯和
+
| 最后贾德,查尔斯和
    
| first = McCleland, Gary
 
| first = McCleland, Gary
第55行: 第55行:  
| first = McCleland, Gary
 
| first = McCleland, Gary
   −
首先是麦克莱兰,加里
+
| 首先是麦克莱兰,加里
    
| year = 1989
 
| year = 1989
第61行: 第61行:  
| year = 1989
 
| year = 1989
   −
1989年
+
| 1989年
    
| title = Data Analysis | publisher = Harcourt Brace Jovanovich
 
| title = Data Analysis | publisher = Harcourt Brace Jovanovich
第67行: 第67行:  
| title = Data Analysis | publisher = Harcourt Brace Jovanovich
 
| title = Data Analysis | publisher = Harcourt Brace Jovanovich
   −
数据分析 | 出版商 Harcourt Brace Jovanovich
+
| 数据分析 | 出版商 Harcourt Brace Jovanovich
    
| isbn = 0-15-516765-0| title-link = Data Analysis
 
| isbn = 0-15-516765-0| title-link = Data Analysis
第87行: 第87行:  
Statistician John Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."
 
Statistician John Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."
   −
统计学家 John Tukey 在1961年将数据分析定义为: ”分析数据的程序,解释这些程序结果的技术,规划数据收集以使其分析更容易、更精确或更准确的方法,以及所有适用于数据分析的(数学)统计的机制和结果
+
统计学家 John Tukey 在1961年将数据分析定义为“分析数据的一些过程,解释这些过程所产生结果的技术,规划数据收集以使数据分析过程更容易、更精确或更准确的方法,以及所有适用于数据分析的(数学)统计的机制和结果。”
      第95行: 第95行:  
There are several phases that can be distinguished, described below. The phases are iterative, in that feedback from later phases may result in additional work in earlier phases.<ref name="Schutt & O'Neil">{{cite book
 
There are several phases that can be distinguished, described below. The phases are iterative, in that feedback from later phases may result in additional work in earlier phases.<ref name="Schutt & O'Neil">{{cite book
   −
有几个阶段可以区分,如下所述。阶段是迭代的,因为后期阶段的反馈可能会导致前期阶段的额外工作。 舒特 & o’ neil { cite book
+
有几个阶段可以区分,如下所述。这些阶段是<font color = '#ff8000'>迭代的iterative</font>,因为后期阶段的反馈可能会导致前期阶段额外的工作。 <ref name="Schutt & O'Neil"> {{ cite book
    
| author2-last = O'Neil | author2-first= Cathy | author2-link= Cathy O'Neil
 
| author2-last = O'Neil | author2-first= Cathy | author2-link= Cathy O'Neil
第107行: 第107行:  
| author1-last = Schutt | author1-first= Rachel
 
| author1-last = Schutt | author1-first= Rachel
   −
最后一位作者: 瑞秋
+
 
    
| year = 2013
 
| year = 2013
第113行: 第113行:  
| year = 2013
 
| year = 2013
   −
2013年
      
| title = Doing Data Science | publisher = [[O'Reilly Media]]
 
| title = Doing Data Science | publisher = [[O'Reilly Media]]
第119行: 第118行:  
| title = Doing Data Science | publisher = O'Reilly Media
 
| title = Doing Data Science | publisher = O'Reilly Media
   −
数据科学 | 出版商 o’ reilly Media
+
|数据科学 | 出版商 o’ reilly Media
    
| isbn = 978-1-449-35865-5}}</ref> The [[Cross-industry standard process for data mining|CRISP framework]] used in [[data mining]] has similar steps.
 
| isbn = 978-1-449-35865-5}}</ref> The [[Cross-industry standard process for data mining|CRISP framework]] used in [[data mining]] has similar steps.
第129行: 第128行:       −
===Data requirements===
+
===Data requirements 数据要求===
    
The data are necessary as inputs to the analysis, which is specified based upon the requirements of those directing the analysis or customers (who will use the finished product of the analysis). The general type of entity upon which the data will be collected is referred to as an experimental unit (e.g., a person or population of people). Specific variables regarding a population (e.g., age and income) may be specified and obtained.  Data may be numerical or categorical (i.e., a text label for numbers).<ref name="Schutt & O'Neil"/>
 
The data are necessary as inputs to the analysis, which is specified based upon the requirements of those directing the analysis or customers (who will use the finished product of the analysis). The general type of entity upon which the data will be collected is referred to as an experimental unit (e.g., a person or population of people). Specific variables regarding a population (e.g., age and income) may be specified and obtained.  Data may be numerical or categorical (i.e., a text label for numbers).<ref name="Schutt & O'Neil"/>
第135行: 第134行:  
The data are necessary as inputs to the analysis, which is specified based upon the requirements of those directing the analysis or customers (who will use the finished product of the analysis). The general type of entity upon which the data will be collected is referred to as an experimental unit (e.g., a person or population of people). Specific variables regarding a population (e.g., age and income) may be specified and obtained.  Data may be numerical or categorical (i.e., a text label for numbers).
 
The data are necessary as inputs to the analysis, which is specified based upon the requirements of those directing the analysis or customers (who will use the finished product of the analysis). The general type of entity upon which the data will be collected is referred to as an experimental unit (e.g., a person or population of people). Specific variables regarding a population (e.g., age and income) may be specified and obtained.  Data may be numerical or categorical (i.e., a text label for numbers).
   −
这些数据作为分析的输入是必要的,分析是基于那些指导分析的人或客户(他们将使用分析的最终产品)的需求而规定的。收集数据的一般实体类型称为试验单位(例如,一个人或一群人)。关于人口的具体变量(例如,年龄和收入)可以指定和获得。数据可以是数字或范畴(例如,数字的文本标签)。
+
这些数据,作为分析的输入,是很必要的,因为分析是基于指导分析的人或客户的需求(这些人将使用分析的最终产品)而规定的。收集数据的一般实体类型称为试验单位(例如,一个人或一群人)。关于<font color = '#ff8000'>总体population</font>的具体变量(例如,年龄和收入)可以被指定和获得。数据可以是数字型的或分类的(也就是数字的文本标签)。
         −
===Data collection===
+
===Data collection 数据收集===
    
Data are collected from a variety of sources. The requirements may be communicated by analysts to custodians of the data, such as information technology personnel within an organization. The data may also be collected from sensors in the environment, such as traffic cameras, satellites, recording devices, etc. It may also be obtained through interviews,  downloads from online sources, or reading documentation.<ref name="Schutt & O'Neil"/>
 
Data are collected from a variety of sources. The requirements may be communicated by analysts to custodians of the data, such as information technology personnel within an organization. The data may also be collected from sensors in the environment, such as traffic cameras, satellites, recording devices, etc. It may also be obtained through interviews,  downloads from online sources, or reading documentation.<ref name="Schutt & O'Neil"/>
第145行: 第144行:  
Data are collected from a variety of sources. The requirements may be communicated by analysts to custodians of the data, such as information technology personnel within an organization. The data may also be collected from sensors in the environment, such as traffic cameras, satellites, recording devices, etc. It may also be obtained through interviews,  downloads from online sources, or reading documentation.
 
Data are collected from a variety of sources. The requirements may be communicated by analysts to custodians of the data, such as information technology personnel within an organization. The data may also be collected from sensors in the environment, such as traffic cameras, satellites, recording devices, etc. It may also be obtained through interviews,  downloads from online sources, or reading documentation.
   −
数据是从各种来源收集的。需求可以由分析人员传达给数据保管人,例如组织内的信息技术人员。这些数据也可以从环境中的传感器收集,如交通摄像机、卫星、记录设备等。它也可以通过访谈,从网上资源下载,或阅读文档获得。
+
数据是从各种来源收集的。需求可以由分析人员传达给数据保管人,例如组织内的信息技术人员。这些数据也可以从环境中的传感器,如交通摄像机、卫星、记录设备等接收。它也可以通过访谈,从网上资源下载,或阅读文档而获得。
         −
===Data processing===
+
===Data processing 数据处理===
    
[[File:Relationship of data, information and intelligence.png|thumb|350px|The phases of the [[intelligence cycle]] used to convert raw information into actionable intelligence or knowledge are conceptually similar to the phases in data analysis.]]
 
[[File:Relationship of data, information and intelligence.png|thumb|350px|The phases of the [[intelligence cycle]] used to convert raw information into actionable intelligence or knowledge are conceptually similar to the phases in data analysis.]]
第155行: 第154行:  
The phases of the [[intelligence cycle used to convert raw information into actionable intelligence or knowledge are conceptually similar to the phases in data analysis.]]
 
The phases of the [[intelligence cycle used to convert raw information into actionable intelligence or knowledge are conceptually similar to the phases in data analysis.]]
   −
[[用于将原始信息转化为可操作的情报或知识的情报周期在概念上类似于数据分析的阶段]
+
[[将原始信息转化为可操作智慧或知识的<font color = '#ff8000'>知识循环intelligence cycle</font>阶段在概念上类似于数据分析中的阶段]
    
Data initially obtained must be processed or organised for analysis. For instance, these may involve placing data into rows and columns in a table format (i.e., [[data model|structured data]]) for further analysis, such as within a spreadsheet or statistical software.<ref name="Schutt & O'Neil"/>
 
Data initially obtained must be processed or organised for analysis. For instance, these may involve placing data into rows and columns in a table format (i.e., [[data model|structured data]]) for further analysis, such as within a spreadsheet or statistical software.<ref name="Schutt & O'Neil"/>
第161行: 第160行:  
Data initially obtained must be processed or organised for analysis. For instance, these may involve placing data into rows and columns in a table format (i.e., structured data) for further analysis, such as within a spreadsheet or statistical software.
 
Data initially obtained must be processed or organised for analysis. For instance, these may involve placing data into rows and columns in a table format (i.e., structured data) for further analysis, such as within a spreadsheet or statistical software.
   −
最初获得的数据必须经过处理或组织以便进行分析。例如,这可能涉及将数据以表格格式(即结构化数据)放置到行和列中,以便进一步分析,例如在电子表格或统计软件中。
+
一开始获得的数据必须经过处理或组织以便进行分析。例如,这可能涉及将数据,例如在电子表格或统计软件中,以表格格式放置到行和列中(即结构化数据)以便后续分析。
         −
===Data cleaning===
+
===Data cleaning 数据清理===
    
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation.<ref>{{cite web|title=Data Cleaning|url=http://research.microsoft.com/en-us/projects/datacleaning/|publisher=Microsoft Research|accessdate=26 October 2013}}</ref> Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable.<ref name="Koomey1">[http://www.perceptualedge.com/articles/b-eye/quantitative_data.pdf Perceptual Edge-Jonathan Koomey-Best practices for understanding quantitative data-February 14, 2006]</ref> Unusual amounts above or below pre-determined thresholds may also be reviewed.  There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.<ref>{{cite journal|last=Hellerstein|first=Joseph|title=Quantitative Data Cleaning for Large Databases|journal=EECS Computer Science Division|date=27 February 2008|page=3|url=http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf|accessdate=26 October 2013}}</ref>
 
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation.<ref>{{cite web|title=Data Cleaning|url=http://research.microsoft.com/en-us/projects/datacleaning/|publisher=Microsoft Research|accessdate=26 October 2013}}</ref> Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable.<ref name="Koomey1">[http://www.perceptualedge.com/articles/b-eye/quantitative_data.pdf Perceptual Edge-Jonathan Koomey-Best practices for understanding quantitative data-February 14, 2006]</ref> Unusual amounts above or below pre-determined thresholds may also be reviewed.  There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.<ref>{{cite journal|last=Hellerstein|first=Joseph|title=Quantitative Data Cleaning for Large Databases|journal=EECS Computer Science Division|date=27 February 2008|page=3|url=http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf|accessdate=26 October 2013}}</ref>
第171行: 第170行:  
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation. Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable. Unusual amounts above or below pre-determined thresholds may also be reviewed.  There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.
 
Once processed and organised, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data are entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation. Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable. Unusual amounts above or below pre-determined thresholds may also be reviewed.  There are several types of data cleaning that depend on the type of data such as phone numbers, email addresses, employers etc.  Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spell checkers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.
   −
一旦处理和组织,数据可能是不完整的,包含重复的,或包含错误。由于输入和存储数据的方式存在问题,因此需要进行数据清理。数据清理是预防和纠正这些错误的过程。常见的任务包括记录匹配、识别数据的不准确性、现有数据的整体质量、重复数据删除和列分割。这样的数据问题也可以通过各种分析技术来识别。例如,利用财务信息,可以将特定变量的总数与被认为可靠的单独公布的数字进行比较。也可审查超过或低于预先确定阈值的异常数额。有几种类型的数据清理依赖于数据的类型,如电话号码,电子邮件地址,雇主等。异常检测的定量数据方法可以用来去除可能输入错误的数据。文本数据拼写检查器可以用来减少拼写错误的单词数量,但是很难判断这些单词本身是否正确。
+
一旦经过处理和组织,就会发现数据可能不完整、包含重复或包含错误。
 +
  --[[用户:嘉树|嘉树]]([[用户讨论:嘉树|讨论]]) 根据语义增加“就会发现”如何如何
 +
由于输入和存储数据的方式的问题,因此我们需要进行数据清理。数据清理是预防和纠正这些错误的过程。常见的任务包括匹配记录、识别不准确的数据、<font color = '#32cd32'>监控</font>现有数据的整体质量、处理数据重复和分割列等。这样的数据问题也可以通过很多种分析技术来识别。例如,利用财务信息,可以将特定变量的与被所有数据认为可靠的单独公布的数字进行比较。高于或低于预先确定的阈值的异常数额可能会被核查。有几种类型的数据清理依赖于数据的类型,如电话号码,电子邮件地址,雇主等。<font color = '#ff8000'>异常值检查outlier detection</font>的定量方法可以用来去除可能的输入错误的数据。<font color = '#ff8000'>文本数据拼写检查器Textual data spell checkers</font>可以用来减少拼写错误单词的数量,但是很难判断这些单词本身是否正确。
         −
===Exploratory data analysis===
+
===Exploratory data analysis 探索性数据分析===
    
Once the data are cleaned, it can be analyzed. Analysts may apply a variety of techniques referred to as [[exploratory data analysis]] to begin understanding the messages contained in the data.<ref>[http://www.perceptualedge.com/articles/ie/the_right_graph.pdf Stephen Few-Perceptual Edge-Selecting the Right Graph For Your Message-September 2004]</ref><ref>[http://cll.stanford.edu/~willb/course/behrens97pm.pdf Behrens-Principles and Procedures of Exploratory Data Analysis-American Psychological Association-1997]</ref> The process of exploration may result in additional data cleaning or additional requests for data, so these activities may be iterative in nature. [[Descriptive statistics]], such as the average or median, may be generated to help understand the data. [[Data visualization]] may also be used to examine the data in graphical format, to obtain additional insight regarding the messages within the data.<ref name="Schutt & O'Neil"/>
 
Once the data are cleaned, it can be analyzed. Analysts may apply a variety of techniques referred to as [[exploratory data analysis]] to begin understanding the messages contained in the data.<ref>[http://www.perceptualedge.com/articles/ie/the_right_graph.pdf Stephen Few-Perceptual Edge-Selecting the Right Graph For Your Message-September 2004]</ref><ref>[http://cll.stanford.edu/~willb/course/behrens97pm.pdf Behrens-Principles and Procedures of Exploratory Data Analysis-American Psychological Association-1997]</ref> The process of exploration may result in additional data cleaning or additional requests for data, so these activities may be iterative in nature. [[Descriptive statistics]], such as the average or median, may be generated to help understand the data. [[Data visualization]] may also be used to examine the data in graphical format, to obtain additional insight regarding the messages within the data.<ref name="Schutt & O'Neil"/>
第181行: 第182行:  
Once the data are cleaned, it can be analyzed. Analysts may apply a variety of techniques referred to as exploratory data analysis to begin understanding the messages contained in the data. The process of exploration may result in additional data cleaning or additional requests for data, so these activities may be iterative in nature. Descriptive statistics, such as the average or median, may be generated to help understand the data. Data visualization may also be used to examine the data in graphical format, to obtain additional insight regarding the messages within the data.
 
Once the data are cleaned, it can be analyzed. Analysts may apply a variety of techniques referred to as exploratory data analysis to begin understanding the messages contained in the data. The process of exploration may result in additional data cleaning or additional requests for data, so these activities may be iterative in nature. Descriptive statistics, such as the average or median, may be generated to help understand the data. Data visualization may also be used to examine the data in graphical format, to obtain additional insight regarding the messages within the data.
   −
一旦数据被清理,就可以进行分析。分析师可能会运用各种被称为探索性数据分析分析的技术来开始理解包含在数据中的信息。勘探过程可能导致额外的数据清理或额外的数据请求,因此这些活动可能具有迭代性质。描述统计学,例如平均值或中位数,可以用来帮助理解数据。数据可视化还可以用于检查图形格式的数据,以获得关于数据中的消息的额外洞察力。
+
数据被清理之后就可以进行分析。分析师可能会运用各种被称为探索性数据分析的技术来着手理解数据中包含的信息。发现的过程可能导致额外的数据清理或数据请求,因此这些活动可能具有迭代性质。描述统计学量,例如平均值或中位数,可以用来帮助理解数据。数据可视化还可以被用于检查图形格式的数据,以获得关于数据中的消息的额外的洞察力。
         −
===Modeling and algorithms===
+
===Modeling and algorithms 建模和算法===
    
Mathematical formulas or models called [[algorithms]] may be applied to the data to identify relationships among the variables, such as [[Correlation and dependence|correlation]] or [[causality|causation]]. In general terms, models may be developed to evaluate a particular variable in the data based on other variable(s) in the data, with some residual error depending on model accuracy (i.e., Data = Model + Error).<ref name="Judd and McClelland 1989"/>
 
Mathematical formulas or models called [[algorithms]] may be applied to the data to identify relationships among the variables, such as [[Correlation and dependence|correlation]] or [[causality|causation]]. In general terms, models may be developed to evaluate a particular variable in the data based on other variable(s) in the data, with some residual error depending on model accuracy (i.e., Data = Model + Error).<ref name="Judd and McClelland 1989"/>
第191行: 第192行:  
Mathematical formulas or models called algorithms may be applied to the data to identify relationships among the variables, such as correlation or causation. In general terms, models may be developed to evaluate a particular variable in the data based on other variable(s) in the data, with some residual error depending on model accuracy (i.e., Data = Model + Error).
 
Mathematical formulas or models called algorithms may be applied to the data to identify relationships among the variables, such as correlation or causation. In general terms, models may be developed to evaluate a particular variable in the data based on other variable(s) in the data, with some residual error depending on model accuracy (i.e., Data = Model + Error).
   −
数学公式或称为算法的模型可应用于数据,以识别变量之间的关系,如相关性或因果关系。一般来说,模型可以根据数据中的其他变量来评估数据中的某一特定变量,而一些残差取决于模型的准确性(即数据模型 + 误差)。
+
数学<font color = '#ff8000'>公式formulas</font>或被称为<font color = '#ff8000'>算法algorithms</font>的<font color = '#ff8000'>模型models</font>可用于在数据中识别变量之间的关系,如相关性或因果关系。一般来说,模型可以根据以下规则建立:根据数据中的其他变量来评估数据中的某一特定变量,并用基于模型的准确性设置残差(即数据 = 模型 + 误差)。
      第199行: 第200行:  
Inferential statistics includes techniques to measure relationships between particular variables.  For example, regression analysis may be used to model whether a change in advertising (independent variable X) explains the variation in sales (dependent variable Y). In mathematical terms, Y (sales) is a function of X (advertising). It may be described as Y&nbsp;= aX + b + error, where the model is designed such that a and b minimize the error when the model predicts Y for a given range of values of X. Analysts may attempt to build models that are descriptive of the data to simplify analysis and communicate results.
 
Inferential statistics includes techniques to measure relationships between particular variables.  For example, regression analysis may be used to model whether a change in advertising (independent variable X) explains the variation in sales (dependent variable Y). In mathematical terms, Y (sales) is a function of X (advertising). It may be described as Y&nbsp;= aX + b + error, where the model is designed such that a and b minimize the error when the model predicts Y for a given range of values of X. Analysts may attempt to build models that are descriptive of the data to simplify analysis and communicate results.
   −
推理统计学包括测量特定变量之间关系的技术。例如,回归分析可以用来模拟广告的变化(独立变量 x)是否可以解释销售的变化(因变量 y)。用数学术语来说,y (销售)是 x (广告)的函数。它可以被描述为 y aX + b + 误差,在这种情况下,模型的设计使得 a 和 b 在模型预测给定范围的 x 的 y 时最小化误差。分析人员可能会尝试建立描述数据的模型,以简化分析和传达结果。
+
推论统计学包括测量特定变量之间关系的技术。例如,回归分析可以用来建立广告的变化的模型,可以知道广告的变化(独立变量 x)是否可以解释销售的变化(因变量 y)。用数学术语来说,y(销售)是 x (广告)的函数。它可以被描述为 Y&nbsp;= aX + b + error,其中模型的设计使得 a 和 b 在模型某预测给定范围 x 的 y 时具有最小的误差。分析人员可能会尝试建立描述数据的模型,以简化分析和表示结果。
         −
===Data product===
+
===Data product 数据产品===
    
A data product is a computer application that takes data inputs and generates outputs, feeding them back into the environment. It may be based on a model or algorithm. An example is an application that analyzes data about customer purchasing history and recommends other purchases the customer might enjoy.<ref name="Schutt & O'Neil"/>
 
A data product is a computer application that takes data inputs and generates outputs, feeding them back into the environment. It may be based on a model or algorithm. An example is an application that analyzes data about customer purchasing history and recommends other purchases the customer might enjoy.<ref name="Schutt & O'Neil"/>
第209行: 第210行:  
A data product is a computer application that takes data inputs and generates outputs, feeding them back into the environment. It may be based on a model or algorithm. An example is an application that analyzes data about customer purchasing history and recommends other purchases the customer might enjoy.
 
A data product is a computer application that takes data inputs and generates outputs, feeding them back into the environment. It may be based on a model or algorithm. An example is an application that analyzes data about customer purchasing history and recommends other purchases the customer might enjoy.
   −
数据产品是一种计算机应用程序,它接收数据输入并生成输出,将它们反馈回环境中。它可能基于一个模型或算法。例如,应用程序分析有关客户购买历史的数据,并推荐客户可能喜欢的其他购买。
+
数据产品是一种计算机应用程序,它接收数据输入并产生输出,将它们反馈回到环境中。它可能基于一个模型或算法。一个例子是一种可以分析客户购买历史并推荐客户可能喜欢的其他物品的应用程序。
         −
===Communication===
+
===Communication 交流===
    
[[File:Social Network Analysis Visualization.png|thumb|250px|[[Data visualization]] to understand the results of a data analysis.<ref>{{Cite journal | volume = 10| issue = 3| last = Grandjean| first = Martin| title = La connaissance est un réseau| journal =Les Cahiers du Numérique| date = 2014| pages = 37–54| url = http://www.martingrandjean.ch/wp-content/uploads/2015/02/Grandjean-2014-Connaissance-reseau.pdf| doi=10.3166/lcn.10.3.37-54}}</ref>]]
 
[[File:Social Network Analysis Visualization.png|thumb|250px|[[Data visualization]] to understand the results of a data analysis.<ref>{{Cite journal | volume = 10| issue = 3| last = Grandjean| first = Martin| title = La connaissance est un réseau| journal =Les Cahiers du Numérique| date = 2014| pages = 37–54| url = http://www.martingrandjean.ch/wp-content/uploads/2015/02/Grandjean-2014-Connaissance-reseau.pdf| doi=10.3166/lcn.10.3.37-54}}</ref>]]
第219行: 第220行:  
[[Data visualization to understand the results of a data analysis.]]
 
[[Data visualization to understand the results of a data analysis.]]
   −
[了解数据分析结果的数据可视化]
+
[用于了解数据分析结果的数据可视化]
      第229行: 第230行:  
Once the data are analyzed, it may be reported in many formats to the users of the analysis to support their requirements. The users may have feedback, which results in additional analysis. As such, much of the analytical cycle is iterative.
 
Once the data are analyzed, it may be reported in many formats to the users of the analysis to support their requirements. The users may have feedback, which results in additional analysis. As such, much of the analytical cycle is iterative.
   −
一旦数据被分析,它可能会以多种格式报告给分析的用户,以支持他们的需求。用户可能会得到反馈,从而导致额外的分析。因此,大部分的分析周期是迭代的。
+
数据被分析后可以用多种格式报告给分析的用户,以支持他们的需求。这些用户可能会有一些反馈,从而需要进行额外的分析。因此,大部分的分析周期是迭代的。
      第237行: 第238行:  
When determining how to communicate the results, the analyst may consider data visualization techniques to help clearly and efficiently communicate the message to the audience. Data visualization uses information displays (such as tables and charts) to help communicate key messages contained in the data. Tables are helpful to a user who might look up specific numbers, while charts (e.g., bar charts or line charts) may help explain the quantitative messages contained in the data.
 
When determining how to communicate the results, the analyst may consider data visualization techniques to help clearly and efficiently communicate the message to the audience. Data visualization uses information displays (such as tables and charts) to help communicate key messages contained in the data. Tables are helpful to a user who might look up specific numbers, while charts (e.g., bar charts or line charts) may help explain the quantitative messages contained in the data.
   −
在决定如何传达结果的时候,分析师可能会考虑'''<font color='#ff8000'>数据可视化</font>'''技术来帮助清晰有效地向听众传达信息。数据可视化使用信息显示(如表格和图表)来帮助传递包含在数据中的关键信息。表格对查找特定数字的用户很有帮助,而图表(例如柱状图或折线图)可以帮助解释数据中包含的定量信息。
+
在决定如何传达结果的时候,分析师可能会考虑'''<font color='#ff8000'>数据可视化</font>'''技术来帮助清晰有效地向听众传达信息。数据可视化使用信息显示(如表格和图表)来帮助传递数据中的关键信息。表格对查找特定数字的用户很有帮助,而图表(例如柱状图或折线图)可以帮助解释数据中的定量信息。
         −
==Quantitative messages==
+
==Quantitative messages 定量数据==
    
{{Main|Data visualization}}
 
{{Main|Data visualization}}
第249行: 第250行:  
A time series illustrated with a line chart demonstrating trends in U.S. federal spending and revenue over time.
 
A time series illustrated with a line chart demonstrating trends in U.S. federal spending and revenue over time.
   −
一个时间序列图表展示了美国联邦政府开支和收入随时间的变化趋势。
+
随时间变化的美国联邦政府收支变化趋势时间序列折线图
    
[[File:U.S. Phillips Curve 2000 to 2013.png|thumb|right|250px|A scatterplot illustrating correlation between two variables (inflation and unemployment) measured at points in time.]]
 
[[File:U.S. Phillips Curve 2000 to 2013.png|thumb|right|250px|A scatterplot illustrating correlation between two variables (inflation and unemployment) measured at points in time.]]
第255行: 第256行:  
A scatterplot illustrating correlation between two variables (inflation and unemployment) measured at points in time.
 
A scatterplot illustrating correlation between two variables (inflation and unemployment) measured at points in time.
   −
散点图说明两个变量(通货膨胀和失业)在时间点上的相关性。
+
两个变量(通货膨胀和失业)在时间点上的相关性散点图
    
Stephen Few described eight types of quantitative messages that users may attempt to understand or communicate from a set of data and the associated graphs used to help communicate the message. Customers specifying requirements and analysts performing the data analysis may consider these messages during the course of the process.
 
Stephen Few described eight types of quantitative messages that users may attempt to understand or communicate from a set of data and the associated graphs used to help communicate the message. Customers specifying requirements and analysts performing the data analysis may consider these messages during the course of the process.
第261行: 第262行:  
Stephen Few described eight types of quantitative messages that users may attempt to understand or communicate from a set of data and the associated graphs used to help communicate the message. Customers specifying requirements and analysts performing the data analysis may consider these messages during the course of the process.
 
Stephen Few described eight types of quantitative messages that users may attempt to understand or communicate from a set of data and the associated graphs used to help communicate the message. Customers specifying requirements and analysts performing the data analysis may consider these messages during the course of the process.
   −
Stephen Few 描述了用户可能试图从一组数据以及用于帮助传达信息的相关图表中理解或传达的八种定量信息。指定需求的客户和执行数据分析的分析人员可以在流程过程中考虑这些消息。
+
Stephen Few 描述了用户可能试图从一组数据以及用于帮助传达信息的相关图表中理解或传达的八种定量信息。指定需求的客户和执行数据分析的分析人员可以在分析过程中考虑这些消息。
    
#Time-series: A single variable is captured over a period of time, such as the unemployment rate over a 10-year period. A [[line chart]] may be used to demonstrate the trend.
 
#Time-series: A single variable is captured over a period of time, such as the unemployment rate over a 10-year period. A [[line chart]] may be used to demonstrate the trend.
第267行: 第268行:  
Time-series: A single variable is captured over a period of time, such as the unemployment rate over a 10-year period. A line chart may be used to demonstrate the trend.
 
Time-series: A single variable is captured over a period of time, such as the unemployment rate over a 10-year period. A line chart may be used to demonstrate the trend.
   −
时间序列: 在一段时间内捕捉单一变量,如10年期间的失业率。可以用折线图来说明趋势。
+
时间序列: 在一段时间内捕捉单一变量,如10年间的失业率。可以用折线图来说明趋势。
    
#Ranking: Categorical subdivisions are ranked in ascending or descending order, such as a ranking of sales performance (the ''measure'') by sales persons (the ''category'', with each sales person a ''categorical subdivision'') during a single period.  A [[bar chart]] may be used to show the comparison across the sales persons.
 
#Ranking: Categorical subdivisions are ranked in ascending or descending order, such as a ranking of sales performance (the ''measure'') by sales persons (the ''category'', with each sales person a ''categorical subdivision'') during a single period.  A [[bar chart]] may be used to show the comparison across the sales persons.
第273行: 第274行:  
Ranking: Categorical subdivisions are ranked in ascending or descending order, such as a ranking of sales performance (the measure) by sales persons (the category, with each sales person a categorical subdivision) during a single period.  A bar chart may be used to show the comparison across the sales persons.
 
Ranking: Categorical subdivisions are ranked in ascending or descending order, such as a ranking of sales performance (the measure) by sales persons (the category, with each sales person a categorical subdivision) during a single period.  A bar chart may be used to show the comparison across the sales persons.
   −
排名: 按升序或降序对分类细分进行排名,例如按销售人员(类别,每个销售人员都有一个分类细分)对一个时期内的销售业绩进行排名。条形图可以用来显示销售人员之间的比较。
+
排名: 按升序或降序对分类子类目进行排名,例如按销售人员
 +
(即类别,每个销售人员都有一个分类子类目)对一个时期内的销售业绩(即测量)进行排名。条形图可以用来在销售人员之间比较。
    
#Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a percentage out of 100%).  A [[pie chart]] or bar chart can show the comparison of ratios, such as the market share represented by competitors in a market.
 
#Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a percentage out of 100%).  A [[pie chart]] or bar chart can show the comparison of ratios, such as the market share represented by competitors in a market.
第279行: 第281行:  
Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a percentage out of 100%).  A pie chart or bar chart can show the comparison of ratios, such as the market share represented by competitors in a market.
 
Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a percentage out of 100%).  A pie chart or bar chart can show the comparison of ratios, such as the market share represented by competitors in a market.
   −
部分对整体: 分类细分是以整体的比例来衡量的(即100% 的百分比)。饼图或条形图可以显示比率的比较,例如市场中竞争对手所占的市场份额。
+
部分对整体: 分类子类目是以部分占整体的比例来衡量的(即占100% 的百分比)。饼图或条形图可以用来显示比率的比较,例如市场中竞争对手所占的市场份额。
    
#Deviation: Categorical subdivisions are compared against a reference, such as a comparison of actual vs. budget expenses for several departments of a business for a given time period.  A bar chart can show comparison of the actual versus the reference amount.
 
#Deviation: Categorical subdivisions are compared against a reference, such as a comparison of actual vs. budget expenses for several departments of a business for a given time period.  A bar chart can show comparison of the actual versus the reference amount.
第285行: 第287行:  
Deviation: Categorical subdivisions are compared against a reference, such as a comparison of actual vs. budget expenses for several departments of a business for a given time period.  A bar chart can show comparison of the actual versus the reference amount.
 
Deviation: Categorical subdivisions are compared against a reference, such as a comparison of actual vs. budget expenses for several departments of a business for a given time period.  A bar chart can show comparison of the actual versus the reference amount.
   −
偏差: 将分类细分与参考数据进行比较,例如对一个企业的几个部门在给定时间内的实际支出与预算支出进行比较。条形图可以显示实际金额与参考金额的比较。
+
偏差: 将分类子类目与参考数据进行比较,例如对一个企业的几个部门在给定时间内实际支出与预算支出进行比较。条形图可以比较实际金额与参考金额的差异。
    
#Frequency distribution: Shows the number of observations of a particular variable for given interval, such as the number of years in which the stock market return is between intervals such as 0–10%, 11–20%, etc. A [[histogram]], a type of bar chart, may be used for this analysis.
 
#Frequency distribution: Shows the number of observations of a particular variable for given interval, such as the number of years in which the stock market return is between intervals such as 0–10%, 11–20%, etc. A [[histogram]], a type of bar chart, may be used for this analysis.
第291行: 第293行:  
Frequency distribution: Shows the number of observations of a particular variable for given interval, such as the number of years in which the stock market return is between intervals such as 0–10%, 11–20%, etc. A histogram, a type of bar chart, may be used for this analysis.
 
Frequency distribution: Shows the number of observations of a particular variable for given interval, such as the number of years in which the stock market return is between intervals such as 0–10%, 11–20%, etc. A histogram, a type of bar chart, may be used for this analysis.
   −
频率分布: 显示特定变量在给定时间间隔内的观测数量,例如股票市场回报率在0-10% 、11-20% 等时间间隔内的年数。直方图,一种条形图,可以用来进行这种分析。
+
频率分布: 显示特定变量在给定时间间隔内的观测数量,例如<font color = '#ff8000'>股票市场回报率stock market return</font>在0-10% 、11-20% 等时间间隔内的年数。直方图作为一种条形图可以用来进行这种分析。
    
#Correlation: Comparison between observations represented by two variables (X,Y) to determine if they tend to move in the same or opposite directions. For example, plotting unemployment (X) and inflation (Y) for a sample of months. A [[scatter plot]] is typically used for this message.
 
#Correlation: Comparison between observations represented by two variables (X,Y) to determine if they tend to move in the same or opposite directions. For example, plotting unemployment (X) and inflation (Y) for a sample of months. A [[scatter plot]] is typically used for this message.
第297行: 第299行:  
Correlation: Comparison between observations represented by two variables (X,Y) to determine if they tend to move in the same or opposite directions. For example, plotting unemployment (X) and inflation (Y) for a sample of months. A scatter plot is typically used for this message.
 
Correlation: Comparison between observations represented by two variables (X,Y) to determine if they tend to move in the same or opposite directions. For example, plotting unemployment (X) and inflation (Y) for a sample of months. A scatter plot is typically used for this message.
   −
相关性: 用两个变量(x,y)表示的观测值之间的比较,以确定它们是否倾向于朝相同或相反的方向移动。例如,绘制个月的样本失业率(x)和通货膨胀率(y)。此消息通常使用散点图。
+
相关性: 用两个变量(x,y)表示的观测值之间的比较,以确定它们是否倾向于朝相同或相反的方向移动。例如,通常使用散点图绘制几个月的失业率x和通货膨胀率y之间的关系。
    
#Nominal comparison: Comparing categorical subdivisions in no particular order, such as the sales volume by product code. A bar chart may be used for this comparison.
 
#Nominal comparison: Comparing categorical subdivisions in no particular order, such as the sales volume by product code. A bar chart may be used for this comparison.
第303行: 第305行:  
Nominal comparison: Comparing categorical subdivisions in no particular order, such as the sales volume by product code. A bar chart may be used for this comparison.
 
Nominal comparison: Comparing categorical subdivisions in no particular order, such as the sales volume by product code. A bar chart may be used for this comparison.
   −
名义上的比较: 比较分类细分,没有特定的顺序,例如按产品代码的销售量。条形图可用于这种比较。
+
称名变量的比较: 比较分类子类目而没有特定的顺序,例如按产品代码标注的销售量。条形图可能被用于作这种比较。
    
#Geographic or geospatial: Comparison of a variable across a map or layout, such as the unemployment rate by state or the number of persons on the various floors of a building. A [[cartogram]] is a typical graphic used.<ref>[http://www.perceptualedge.com/articles/ie/the_right_graph.pdf Stephen Few-Perceptual Edge-Selecting the Right Graph for Your Message-2004]</ref><ref>[http://www.perceptualedge.com/articles/misc/Graph_Selection_Matrix.pdf Stephen Few-Perceptual Edge-Graph Selection Matrix]</ref>
 
#Geographic or geospatial: Comparison of a variable across a map or layout, such as the unemployment rate by state or the number of persons on the various floors of a building. A [[cartogram]] is a typical graphic used.<ref>[http://www.perceptualedge.com/articles/ie/the_right_graph.pdf Stephen Few-Perceptual Edge-Selecting the Right Graph for Your Message-2004]</ref><ref>[http://www.perceptualedge.com/articles/misc/Graph_Selection_Matrix.pdf Stephen Few-Perceptual Edge-Graph Selection Matrix]</ref>
第309行: 第311行:  
Geographic or geospatial: Comparison of a variable across a map or layout, such as the unemployment rate by state or the number of persons on the various floors of a building. A cartogram is a typical graphic used.
 
Geographic or geospatial: Comparison of a variable across a map or layout, such as the unemployment rate by state or the number of persons on the various floors of a building. A cartogram is a typical graphic used.
   −
地理或地理空间: 在地图或布局中对一个变量的比较,例如按州分列的失业率或建筑物各层的人数。地图是一种典型的图形。
+
地理图或地理空间: 在地图或布局中对一个变量的比较,例如州的失业率或建筑物各层的人数。地图是一种典型的图形。
 
       
259

个编辑