数据科学

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
18621066378讨论 | 贡献2020年5月7日 (四) 16:42的版本
跳到导航 跳到搜索

此词条暂由彩云小译翻译,未经人工整理和审校,带来阅读不便,请见谅。


旧版有这个词条,感觉可以在方法论层面再充实一下



Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.[1][2] Data science is related to data mining and big data.

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining and big data.

数据科学是一个跨学科的领域,它使用科学的方法、过程、算法和系统从许多结构和非结构化数据中提取知识和见解。数据科学与数据挖掘和大数据有关。



Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data.[3] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.[4][5]

Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.

数据科学是一个“统一统计学、数据分析、机器学习及其相关方法的概念” ,目的是用数据“理解和分析实际现象”。它使用的技术和理论从许多领域的背景下,数学,统计,计算机科学和信息科学。图灵奖获得者吉姆•格雷(Jim Gray)将数据科学想象为科学的“第四范式”(经验主义、理论主义、计算主义,现在是数据驱动的) ,并断言“由于信息技术和数据泛滥的影响,科学的一切都在改变”。




总有机碳



Foundations

Foundations

地基

Data science is an interdisciplinary field focused on extracting knowledge from data sets, which are typically large (see big data).[6] The field encompasses analysis, preparing data for analysis, and presenting findings to inform high-level decisions in an organization. As such, it incorporates skills from computer science, mathematics, statistics, information visualization, graphic design, and business.[7][8] Statistician Nathan Yau, drawing on Ben Fry, also links data science to human-computer interaction: users should be able to intuitively control and explore data.[9][10] In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.[11]

Data science is an interdisciplinary field focused on extracting knowledge from data sets, which are typically large (see big data). The field encompasses analysis, preparing data for analysis, and presenting findings to inform high-level decisions in an organization. As such, it incorporates skills from computer science, mathematics, statistics, information visualization, graphic design, and business. Statistician Nathan Yau, drawing on Ben Fry, also links data science to human-computer interaction: users should be able to intuitively control and explore data. In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.

数据科学是一个跨学科的领域,专注于从数据集中提取知识,这些数据集通常都很大(见大数据)。这个领域包括分析,为分析准备数据,以及为组织的高层决策提供结果。因此,它融合了来自计算机科学、数学、统计学、信息可视化、平面设计和商业的技能。统计学家 Nathan Yau 借鉴 Ben Fry 的观点,也把数据科学和人机交互联系起来: 用户应该能够直观地控制和探索数据。2015年,美国统计协会确定数据库管理、统计和机器学习,以及分布式和并行系统为三个新兴的基础专业社区。



Relationship to statistics

Relationship to statistics

与统计学的关系

Many statisticians, including Nate Silver, have argued that data science is not a new field, but rather another name for statistics.[12] Others argue that data science is distinct from statistics because it focuses on problems and techniques unique to digital data.[13] Vasant Dhar writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g. images) and emphasizes prediction and action.[14] Andrew Gelman of Columbia University and data scientist Vincent Granville have described statistics as a nonessential part of data science.[15][16]

Many statisticians, including Nate Silver, have argued that data science is not a new field, but rather another name for statistics. Others argue that data science is distinct from statistics because it focuses on problems and techniques unique to digital data. Vasant Dhar writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g. images) and emphasizes prediction and action. Andrew Gelman of Columbia University and data scientist Vincent Granville have described statistics as a nonessential part of data science.

包括纳特 · 西尔弗在内的许多统计学家都认为,数据科学不是一个新领域,而是统计学的另一个名称。其他人则认为数据科学不同于统计学,因为它专注于数字数据所特有的问题和技术。瓦桑特 · 达尔写道,统计学强调定量数据和描述。相比之下,数据科学研究的是定量和定性的数据。图片) ,并强调预测和行动。哥伦比亚大学的安德鲁 · 格尔曼和数据科学家文森特 · 格兰维尔将统计学描述为数据科学中不重要的部分。



Stanford professor David Donoho writes that data science is not distinguished from statistics by the size of datasets or use of computing, and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data science program. He describes data science as an applied field growing out of traditional statistics.[17]

Stanford professor David Donoho writes that data science is not distinguished from statistics by the size of datasets or use of computing, and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data science program. He describes data science as an applied field growing out of traditional statistics.

斯坦福大学教授 David Donoho 写道,数据科学与统计学之间并不存在数据集的大小或计算机的使用,许多研究生课程错误地宣传他们的分析学和统计学训练是数据科学课程的本质。他把数据科学描述为从传统统计学中发展出来的一个应用领域。



Etymology

Etymology

词源学



Early usage

Early usage

早期使用



In 1962, John Tukey described a field he called “data analysis,” which resembles modern data science.[17] Later, attendees at a 1992 statistics symposium at the University of Montpellier II acknowledged the emergence of a new discipline focused on data of various origins and forms, combining established concepts and principles of statistics and data analysis with computing.[18][19]

In 1962, John Tukey described a field he called “data analysis,” which resembles modern data science. Later, attendees at a 1992 statistics symposium at the University of Montpellier II acknowledged the emergence of a new discipline focused on data of various origins and forms, combining established concepts and principles of statistics and data analysis with computing.

1962年,John Tukey 描述了一个他称之为“数据分析”的领域,类似于现代数据科学。后来,参加1992年第二届蒙彼利埃大学统计研讨会的与会者承认了一个新的学科的出现,这个学科专注于各种起源和形式的数据,将统计和数据分析的既定概念和原则与计算结合起来。



The term “data science” has been traced back to 1974, when Peter Naur proposed it as an alternative name for computer science.[20] In 1996, the International Federation of Classification Societies became the first conference to specifically feature data science as a topic.[20] However, the definition was still in flux. In 1997, C.F. Jeff Wu suggested that statistics should be renamed data science. He reasoned that a new name would help statistics shed inaccurate stereotypes, such as being synonymous with accounting, or limited to describing data.[21] In 1998, Chikio Hayashi argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection, and analysis.[22]

The term “data science” has been traced back to 1974, when Peter Naur proposed it as an alternative name for computer science. In 1996, the International Federation of Classification Societies became the first conference to specifically feature data science as a topic. In 1998, Chikio Hayashi argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection, and analysis.

术语“数据科学”可以追溯到1974年,彼得 · 诺尔提出它作为计算机科学的替代名称。1996年,国际船级社联合会成为第一个以数据科学为专题的会议。1998年,林志雄主张数据科学是一个新的、跨学科的概念,包括数据设计、数据收集和数据分析三个方面。



During the 1990s, popular terms for the process of finding patterns in datasets (which were increasingly large) included “knowledge discovery” and “data mining.”[23][20]

During the 1990s, popular terms for the process of finding patterns in datasets (which were increasingly large) included “knowledge discovery” and “data mining.”

在20世纪90年代,在数据集中寻找模式的流行术语(数据集越来越大)包括“知识发现”和“数据挖掘”



Modern usage

Modern usage

现代用法

The modern conception of data science as an independent discipline is sometimes attributed to William S. Cleveland.[24] In a 2001 paper, he advocated an expansion of statistics beyond theory into technical areas; because this would significantly change the field, it warranted a new name.[23] "Data science" became more widely used in the next few years: in 2002, the Committee on Data for Science and Technology launched Data Science Journal. In 2003, Columbia University launched The Journal of Data Science.[23] In 2014, the American Statistical Association's Section on Statistical Learning and Data Mining changed its name to the Section on Statistical Learning and Data Science, reflecting the ascendant popularity of data science.[25]

The modern conception of data science as an independent discipline is sometimes attributed to William S. Cleveland. In a 2001 paper, he advocated an expansion of statistics beyond theory into technical areas; because this would significantly change the field, it warranted a new name.

数据科学作为一门独立学科的现代概念,有时归功于威廉 · s · 克利夫兰。在2001年的一篇论文中,他主张将统计学从理论扩展到技术领域; 因为这将大大改变这个领域,它需要一个新的名称。



The professional title of “data scientist” has been attributed to DJ Patil and Jeff Hammerbacher in 2008.[26] Though it was used by the National Science Board in their 2005 report, "Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century," it referred broadly to any key role in managing a digital data collection.[27]

The professional title of “data scientist” has been attributed to DJ Patil and Jeff Hammerbacher in 2008. Though it was used by the National Science Board in their 2005 report, "Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century," it referred broadly to any key role in managing a digital data collection.

2008年,DJ 帕蒂尔和杰夫哈默巴赫尔被授予“数据科学家”的职称。尽管美国国家科学委员会(National Science Board)在其2005年的报告《长期数字数据收集: 21世纪的研究和教育成果》(Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century)中使用了这个词,但它广泛地提到了管理数字数。



There is still no consensus on the definition of data science and it is considered by some to be a buzzword.[28]

There is still no consensus on the definition of data science and it is considered by some to be a buzzword.

对于数据科学的定义还没有达成共识,有些人认为这是一个流行词。



Careers in data science

Careers in data science

数据科学的职业

Data science is a growing field. A career as a data scientist is ranked at the third best job in America for 2020 by Glassdoor, and was ranked the number one best job from 2016-2019.[29] Data scientists have a median salary of $118,370 per year or $56.91 per hour.[30] Job growth in this field is also above average, with a projected increase of 16% from 2018 to 2028.[30] The largest employer of data scientists in the US is the federal government, employing 28% of the data science workforce.[30] Other large employers of data scientists are computer system design services, research and development laboratories, and colleges and universities.[30] Typically, data scientists work full time, and some work more than 40 hours a week.[30]

Data science is a growing field. A career as a data scientist is ranked at the third best job in America for 2020 by Glassdoor, and was ranked the number one best job from 2016-2019. Data scientists have a median salary of $118,370 per year or $56.91 per hour. Job growth in this field is also above average, with a projected increase of 16% from 2018 to 2028. The largest employer of data scientists in the US is the federal government, employing 28% of the data science workforce. Other large employers of data scientists are computer system design services, research and development laboratories, and colleges and universities. Typically, data scientists work full time, and some work more than 40 hours a week.

数据科学是一个不断发展的领域。数据科学家的职业被 Glassdoor 评为2020年美国最佳工作的第三名,并被评为2016-2019年最佳工作的第一名。数据科学家的平均工资是每年118,370美元或每小时56.91美元。该领域的就业增长也高于平均水平,预计从2018年到2028年将增长16% 。美国数据科学家的最大雇主是联邦政府,雇佣了28% 的数据科学工作人员。数据科学家的其他大型雇主有计算机系统设计服务、研究和开发实验室以及学院和大学。通常情况下,数据科学家全职工作,有些人每周工作超过40小时。



Educational path

Educational path

教育途径

In order to become a data scientist, there is a significant amount of education and experience required. The first step in becoming a data scientist is to earn a bachelor's degree, typically in a field related to computing or mathematics.[31][30] Coding bootcamps are also available and can be used as an alternate pre-qualification to supplement a bachelor's degree in another field.[31] Most data scientists also complete a master’s degree or a PhD in data science.[31] Once these qualifications are met, the next step to becoming a data scientist is to apply for an entry level job in the field.[31] Some data scientists may later choose to specialize in a sub-field of data science.[31]

In order to become a data scientist, there is a significant amount of education and experience required. The first step in becoming a data scientist is to earn a bachelor's degree, typically in a field related to computing or mathematics. Coding bootcamps are also available and can be used as an alternate pre-qualification to supplement a bachelor's degree in another field. Most data scientists also complete a master’s degree or a PhD in data science. Once these qualifications are met, the next step to becoming a data scientist is to apply for an entry level job in the field. Some data scientists may later choose to specialize in a sub-field of data science.

要成为一名数据科学家,需要大量的教育和经验。成为数据科学家的第一步是获得学士学位,通常是在与计算或数学相关的领域。编程训练营也是可用的,可以作为其他领域的学士学位的补充资格预审。大多数数据科学家还完成了数据科学的硕士学位或博士学位。一旦符合这些条件,成为数据科学家的下一步就是申请该领域的入门级工作。一些数据科学家以后可能会选择专攻数据科学的一个分支领域。



Specializations and associated careers

Specializations and associated careers

专业化和相关职业



  • Machine Learning Scientist: Machine learning scientists research new methods of data analysis and create algorithms.[32]


  • Data Analyst: Data analysts utilize large data sets to gather information that meets their company’s needs.[32]


  • Data Consultant: Data consultants work with businesses to determine the best usage of the information yielded from data analysis.[31]


  • Data Architect: Data architects build data solutions that are optimized for performance and design applications.[32]


  • Applications Architect: Applications architects track how applications are used throughout a business and how they interact with users and other applications.[32]




Impacts of data science

Impacts of data science

数据科学的影响

Big data is very quickly becoming a vital tool for businesses and companies of all sizes.[33] The availability and interpretation of big data has altered the business models of old industries and enabled the creation of new ones.[33] Data-driven businesses are worth $1.2 trillion collectively in 2020, an increase from $333 billion in the year 2015.[34] Data scientists are responsible for breaking down big data into usable information and creating software and algorithms that help companies and organizations determine optimal operations.[34] As big data continues to have a major impact on the world, data science does as well due to the close relationship between the two.[34]

Big data is very quickly becoming a vital tool for businesses and companies of all sizes. The availability and interpretation of big data has altered the business models of old industries and enabled the creation of new ones. Data scientists are responsible for breaking down big data into usable information and creating software and algorithms that help companies and organizations determine optimal operations. As big data continues to have a major impact on the world, data science does as well due to the close relationship between the two.

大数据正迅速成为各种规模的企业和公司的重要工具。大数据的可用性和解释改变了旧行业的商业模式,并促成了新行业的创建。数据科学家负责将大数据分解成可用的信息,并创建软件和算法,帮助企业和组织确定最佳操作。随着大数据继续对世界产生重大影响,数据科学也由于两者之间的密切关系而产生重大影响。



Technologies and techniques

Technologies and techniques

技术和技术

There are a variety of different technologies and techniques that are used for data science which depending on the application.

There are a variety of different technologies and techniques that are used for data science which depending on the application.

有各种不同的技术和技术用于数据科学,这取决于应用。



Techniques

Techniques

Techniques

  • Clustering is a technique used to group data together.



  • Machine learning is a technique used to perform tasks by inferencing patterns from data.




Technologies

Technologies

技术



  • Python is a programming language with simple syntax that is commonly used for data science.[35] There are a number of python libraries that are used in data science including numpy, pandas, and scipy.


  • R is a programming language that was designed for statisticians and data mining[36] and is optimized for computation.


  • TensorFlow is a framework for creating machine learning models developed by Google.


  • Pytorch is another framework for machine learning developed by Facebook.


  • Jupyter Notebook is an interactive web interface for Python that allows faster experimentation.


  • Tableau makes a variety of software that is used for data visualization[37].


  • Apache Hadoop is a software framework that is used to process data over large distributed systems.




References

References

参考资料

  1. Dhar, V. (2013). "Data science and prediction". Communications of the ACM. 56 (12): 64–73. doi:10.1145/2500499. Archived from the original on 9 November 2014. Retrieved 2 September 2015.
  2. Jeff Leek (2013-12-12). "The key word in "Data Science" is not Data, it is Science". Simply Statistics. Archived from the original on 2 January 2014. Retrieved 1 January 2014.
  3. Hayashi, Chikio (1998-01-01). "What is Data Science? Fundamental Concepts and a Heuristic Example". In Hayashi, Chikio (in en). Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization. Springer Japan. pp. 40–51. doi:10.1007/978-4-431-65950-1_3. ISBN 9784431702085. https://www.springer.com/book/9784431702085. 
  4. Stewart Tansley; Kristin Michele Tolle (2009). The Fourth Paradigm: Data-intensive Scientific Discovery. Microsoft Research. ISBN 978-0-9825442-0-4. https://books.google.com/?id=oGs_AQAAIAAJ. 
  5. Bell, G.; Hey, T.; Szalay, A. (2009). "COMPUTER SCIENCE: Beyond the Data Deluge". Science. 323 (5919): 1297–1298. doi:10.1126/science.1170411. ISSN 0036-8075. PMID 19265007.
  6. "About Data Science | Data Science Association". www.datascienceassn.org. Retrieved 2020-04-03.
  7. "1. Introduction: What Is Data Science? - Doing Data Science [Book]". www.oreilly.com (in English). Retrieved 2020-04-03.
  8. "the three sexy skills of data geeks". m.e.driscoll: data utopian (in English). Retrieved 2020-04-03.
  9. Yau, Nathan (2009-06-04). "Rise of the Data Scientist". FlowingData (in English). Retrieved 2020-04-03.
  10. "Basic Example". benfry.com. Retrieved 2020-04-03.{{cite web}}: CS1 maint: url-status (link)
  11. "ASA Statement on the Role of Statistics in Data Science". AMSTATNEWS. American Statistical Association. 2015-10-01. Archived from the original on 20 June 2019. Retrieved 2019-05-29.
  12. "Nate Silver: What I need from statisticians - Statistics Views". www.statisticsviews.com. Retrieved 2020-04-03.
  13. "What's the Difference Between Data Science and Statistics?". Priceonomics (in English). Retrieved 2020-04-03.
  14. DharVasant (2013-12-01). "Data science and prediction". Communications of the ACM (in English). 56 (12): 64–73. doi:10.1145/2500499.
  15. "Statistics is the least important part of data science « Statistical Modeling, Causal Inference, and Social Science". statmodeling.stat.columbia.edu. Retrieved 2020-04-03.
  16. Posted by Vincent Granville on December 8, 2014 at 5:00pm; Blog, View. "Data science without statistics is possible, even desirable". www.datasciencecentral.com (in English). Retrieved 2020-04-03.
  17. 17.0 17.1 Donoho, David (September 18, 2015). "50 years of Data Science" (PDF). Retrieved April 2, 2020.{{cite web}}: CS1 maint: url-status (link)
  18. Data science and its applications = La @science des données et ses applications. Escoufier, Yves., Hayashi, Chikio (1918-....)., Fichet, Bernard.. Tokyo: Academic Press/Harcourt Brace. 1995. ISBN 0-12-241770-4. OCLC 489990740. 
  19. Murtagh, Fionn; Devlin, Keith (2018). "The Development of Data Science: Implications for Education, Employment, Research, and the Data Revolution for Sustainable Development". Big Data and Cognitive Computing (in English). 2 (2): 14. doi:10.3390/bdcc2020014.
  20. 20.0 20.1 20.2 CaoLongbing (2017-06-29). "Data Science". ACM Computing Surveys (CSUR) (in English). 50 (3): 1–42. doi:10.1145/3076253.
  21. Wu, C.F. Jeff. "Statistics=Data Science?" (PDF). Retrieved April 2, 2020.{{cite web}}: CS1 maint: url-status (link)
  22. Murtagh, Fionn; Devlin, Keith (2018). "The Development of Data Science: Implications for Education, Employment, Research, and the Data Revolution for Sustainable Development". Big Data and Cognitive Computing (in English). 2 (2): 14. doi:10.3390/bdcc2020014.
  23. 23.0 23.1 23.2 Press, Gil. "A Very Short History Of Data Science". Forbes (in English). Retrieved 2020-04-03.
  24. Gupta, Shanti (December 11, 2015). "William S Cleveland". Retrieved April 2, 2020.{{cite web}}: CS1 maint: url-status (link)
  25. Talley, Jill (June 1, 2016). "ASA Expands Scope, Outreach to Foster Growth, Collaboration in Data Science". Amstat News. American Statistical Association.{{cite news}}: CS1 maint: url-status (link)
  26. Davenport, Thomas H.; Patil, D. J. (2012-10-01). "Data Scientist: The Sexiest Job of the 21st Century". Harvard Business Review. No. October 2012. ISSN 0017-8012. Retrieved 2020-04-03.
  27. "US NSF - NSB-05-40, Long-Lived Digital Data Collections Enabling Research and Education in the 21st Century". www.nsf.gov. Retrieved 2020-04-03.
  28. Press, Gil. "Data Science: What's The Half-Life Of A Buzzword?". Forbes (in English). Retrieved 2020-04-03.
  29. "Best Jobs in America". Glassdoor (in English). Retrieved 2020-04-03.
  30. 30.0 30.1 30.2 30.3 30.4 30.5 "Computer and Information Research Scientists : Occupational Outlook Handbook: : U.S. Bureau of Labor Statistics". www.bls.gov (in English). Retrieved 2020-04-03.
  31. 31.0 31.1 31.2 31.3 31.4 31.5 "What is a Data Scientist?". Master's in Data Science (in English). Retrieved 2020-04-03.
  32. 32.0 32.1 32.2 32.3 "11 Data Science Careers Shaping the Future". Northeastern University Graduate Programs (in English). 2018-11-23. Retrieved 2020-04-03.
  33. 33.0 33.1 Pham, Peter. "The Impacts Of Big Data That You May Not Have Heard Of". Forbes (in English). Retrieved 2020-04-03.
  34. 34.0 34.1 34.2 Martin, Sophia (2019-09-20). "How Data Science will Impact Future of Businesses?". Medium (in English). Retrieved 2020-04-03.
  35. Shell, M Scott (September 24, 2019). "An introduction to Python for scientific computing" (PDF). Retrieved April 2, 2020.{{cite web}}: CS1 maint: url-status (link)
  36. "R FAQ". cran.r-project.org. Retrieved 2020-04-03.
  37. Rhodes, Margaret (15 July 2014). "A Dead-Simple Tool That Lets Anyone Create Interactive Maps". Wired. Retrieved 2020-04-03.

Category:Information science

类别: 信息科学

Category:Computer occupations

类别: 计算机职业

Category:Computational fields of study

类别: 研究的计算领域

Category:Data analysis

类别: 数据分析


This page was moved from wikipedia:en:Data science. Its edit history can be viewed at 数据科学/edithistory