更改

添加27,164字节 、 2020年5月9日 (六) 12:47
无编辑摘要
第1行: 第1行:  
此词条暂由彩云小译翻译,未经人工整理和审校,带来阅读不便,请见谅。
 
此词条暂由彩云小译翻译,未经人工整理和审校,带来阅读不便,请见谅。
 +
 +
* 词条预计填充内容
 +
1.foundations 背景(了解的一些基础知识);
 +
2.术语内涵衍变(该术语如何产生及目前为止用法的一些不同);
 +
3.数据科学的研究内容
 +
3.1数据科学基础理论
 +
3.2 数据预处理
 +
3.3数据计算
 +
3.4数据管理
 +
4.在数据科学方面的职业和工作;
 +
5.数据科学的影响;
 +
6.数据科学中所涉及的一些技术和应用软件;
 +
7.数据科学、人工智能、机器学习之间的差别
 +
找到两篇博文供参考https://blog.csdn.net/fengdu78/article/details/105154546  https://blog.csdn.net/dev_csdn/article/details/79127658
 +
8.与统计学的关系
 +
 +
其中,第2部分是需要搜集补充的内容,第7部分有一些参考资料(后续还会再找一些),第8部分可进行补充。
 +
 +
需要翻译部分'''引言、1,2,4,5,6''',需要补充部分'''3、7、8'''
 +
 +
*任务分配
 +
'''任务一:引言,1背景、2术语内涵、3研究内容'''
 +
其中'''背景'''部分文字需要进行翻译;'''引言、术语内涵'''已有参考资料和初期的人工翻译文本,'''研究内容'''需要找到资料进行填充;
 +
'''任务二:4相关职业、5数据科学的影响'''
 +
其中并没有初期的人工翻译文本,可进一步搜集资料,使其更加完善完善;
 +
'''任务三:6相关应用软件、7与机器学习人工智能的差别、8与统计学的关系'''
 +
其中7、8需要搜集资料进行填充,8已有参考资料和初期的人工翻译文本;
 +
 +
*附言
 +
#任务完成上交为5月10号下午六点前
 +
#大家有相关的参考资料也可以共享出来,并发给[[趣木木]]以便后期编者推荐时挑选进行运用
 +
#觉得还需要再添加什么模块,可及时微信私聊[[趣木木]]
      第16行: 第48行:       −
 
+
  --[[用户:趣木木|趣木木]]([[用户讨论:趣木木|讨论]])下为旧版相对应的引言内容的参考 可进行一下整及或填充
      第26行: 第58行:        +
数据科学类似于[https://en.wikipedia.org/wiki/Data_mining 数据挖掘],是一个使用科学的方法、过程、算法和系统,从有结构或无结构的各种形式的[https://en.wikipedia.org/wiki/Data 数据]中提炼[https://en.wikipedia.org/wiki/Knowledge 知识]和见解的跨学科领域。
 +
<ref name=":0">
 +
{{Cite journal
 +
| last1 = Dhar
 +
| first1 = V.
 +
| title = Data science and prediction
 +
| doi : 10.1145/2500499
 +
| journal = Communications of the ACM
 +
| volume = 56
 +
| issue = 12
 +
| pages = 64
 +
| year = 2013
 +
| pmid = 
 +
| pmc =
 +
| url = http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext
 +
}}</ref>
 +
<ref>
 +
Jeff Leek
 +
2013-12-12.
 +
[http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/ The key word in "Data Science" is not Data, it is Science.]
 +
Simply Statistics.
 +
</ref>
 +
数据科学的概念结合了统计学、数据分析、机器学习等相关方法以便于借助数据理解和分析实际现象。
 +
<ref name="Hayashi" />
 +
它使用了来自[https://en.wikipedia.org/wiki/Mathematics 数学]、[https://en.wikipedia.org/wiki/Statistics 统计学]、[https://en.wikipedia.org/wiki/Information_science 信息科学]、[https://en.wikipedia.org/wiki/Computer_science 计算机科学]等许多学科领域的技巧和理论。
    +
[https://en.wikipedia.org/wiki/Turing_award 图灵奖]得主[https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist) 吉姆·格雷](Jim Gray)将数据科学设想为一种科学的“第四范式”([https://en.wikipedia.org/wiki/Empirical_research 经验主义]、[https://en.wikipedia.org/wiki/Basic_research 理论研究]、计算机辅助,现在是数据驱动),并且断言所有关于科学的事物由于信息技术和[https://en.wikipedia.org/wiki/Information_explosion 数据洪流]的影响在不断地发生改变。
 +
<ref name="TansleyTolle2009">
 +
{{cite book
 +
|author1=Stewart Tansley
 +
|author2=Kristin Michele Tolle
 +
|title=The Fourth Paradigm: Data-intensive Scientific Discovery
 +
|url=https://books.google.com/books?id=oGs_AQAAIAAJ
 +
|year=2009
 +
|publisher=Microsoft Research
 +
|isbn:978-0-9825442-0-4
 +
}}</ref>
 +
<ref name="BellHey2009">
 +
{{cite journal
 +
|last1=Bell
 +
|first1=G.
 +
|last2=Hey
 +
|first2=T.
 +
|last3=Szalay
 +
|first3=A.
 +
|title=COMPUTER SCIENCE: Beyond the Data Deluge
 +
|journal=Science
 +
|volume=323
 +
|issue=5919
 +
|year=2009
 +
|pages=1297–1298
 +
|issn:0036-8075
 +
|doi:10.1126/science.1170411
 +
}}</ref>
 +
在2012年[https://en.wikipedia.org/wiki/Harvard_Business_Review 《哈佛商业评论》]称其为“21世纪最富有魅力的工作”后
 +
<ref name="Harvard" />
 +
,“数据科学”成了一个[https://en.wikipedia.org/wiki/Buzzword 流行术语]。它现在经常与早期概念互换使用,例如[https://en.wikipedia.org/wiki/Business_analytics 商业分析]
 +
<ref name="GilPress" />
 +
、[https://en.wikipedia.org/wiki/Business_intelligence 商业智能]、[https://en.wikipedia.org/wiki/Predictive_modelling 预测模型]和[https://en.wikipedia.org/wiki/Statistics 统计学]。“数据科学富有魅力”的观点甚至被汉斯·罗斯林(Hans Rosling)博士在2011年BBC纪录片中转述为“统计学是当今世界最具吸引力的学科。”内特·西尔弗(Nate Silver)
 +
<ref name="NateSilver" />
 +
则将数据科学描述为一种对于统计学家更具吸引力的词语。在许多场合,为了博人眼球,一些早期的解决方案现在被简单地打上了“数据科学”的旗号,而这可能冲淡这个术语的效用。
 +
<ref>
 +
Warden, Pete(2011-05-09).
 +
[http://radar.oreilly.com/2011/05/data-science-terminology.html "Why the term "data science" is flawed but useful"]
 +
''O'Reilly Radar''. Retrieved 2018-05-20.
 +
</ref>
 +
虽然现在许多大学的项目都提供数据科学学位,然而它们对数据科学的定义或者合适的课程内容都没有达成一致。
 +
<ref name="GilPress" />
 +
数据科学学位分量大跌,究其原因是许多数据科学和[https://en.wikipedia.org/wiki/Big_data 大数据]项目没能给出有用的结果,而这通常是糟糕的管理和资源利用造成的。
 +
<ref>
 +
[https://hbr.org/2018/01/are-you-setting-your-data-scientists-up-to-fail "Are You Setting Your Data Scientists Up to Fail?"].
 +
''Harvard Business Review''.2018-01-25. Retrieved 2018-05-26.
 +
</ref>
 +
<ref>
 +
[https://www.consultancy.uk/news/16839/70-of-big-data-projects-in-uk-fail-to-realise-full-potential "70% of Big Data projects in UK fail to realise full potential"]
 +
''www.consultancy.uk.'' Retrieved 2018-05-26.
 +
</ref>
 +
<ref>
 +
[http://analytics-magazine.org/the-data-economy-why-do-so-many-analytics-projects-fail/ "The Data Economy: Why do so many analytics projects fail? - Analytics Magazine"].
 +
''Analytics Magazine''. 2014-07-07. Retrieved 2018-05-26.
 +
</ref>
 +
<ref>
 +
[https://www.kdnuggets.com/2018/05/data-science-4-reasons-failing-deliver.html "Data Science: 4 Reasons Why Most Are Failing to Deliver"]. ''www.kdnuggets.com''. Retrieved 2018-05-26.
 +
</ref>
      第41行: 第156行:  
== Foundations ==
 
== Foundations ==
   −
== Foundations ==
+
== Foundations背景 ==
 +
 
   −
地基
      
Data science is an interdisciplinary field focused on extracting knowledge from data sets, which are typically large (see [[big data]]).<ref>{{Cite web|url=http://www.datascienceassn.org/about-data-science|title=About Data Science {{!}} Data Science Association|website=www.datascienceassn.org|access-date=2020-04-03}}</ref> The field encompasses analysis, preparing data for analysis, and presenting findings to inform high-level decisions in an organization. As such, it incorporates skills from computer science, mathematics, statistics, [[information visualization]], graphic design, and business.<ref>{{Cite web|url=https://www.oreilly.com/library/view/doing-data-science/9781449363871/ch01.html|title=1. Introduction: What Is Data Science? - Doing Data Science [Book]|website=www.oreilly.com|language=en|access-date=2020-04-03}}</ref><ref>{{Cite web|url=https://medriscoll.com/post/4740157098/the-three-sexy-skills-of-data-geeks|title=the three sexy skills of data geeks|website=m.e.driscoll: data utopian|language=en|access-date=2020-04-03}}</ref> Statistician [[Nathan Yau]], drawing on [[Ben Fry]], also links data science to [[Human–computer interaction|human-computer interaction]]: users should be able to intuitively control and explore data.<ref>{{Cite web|url=https://flowingdata.com/2009/06/04/rise-of-the-data-scientist/|title=Rise of the Data Scientist|last=Yau|first=Nathan|date=2009-06-04|website=FlowingData|language=en|access-date=2020-04-03}}</ref><ref>{{Cite web|url=https://benfry.com/phd/dissertation/2.html|title=Basic Example|last=|first=|date=|website=benfry.com|url-status=live|archive-url=|archive-date=|access-date=2020-04-03}}</ref> In 2015, the [[American Statistical Association]] identified [[Database|database management]], statistics and [[machine learning]], and [[Distributed computing|distributed and parallel systems]] as the three emerging foundational professional communities.<ref>{{Cite web|url=https://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/|title=ASA Statement on the Role of Statistics in Data Science|date=2015-10-01|website=AMSTATNEWS|publisher=[[American Statistical Association]]|access-date=2019-05-29|archive-url=https://web.archive.org/web/20190620184935/https://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/|archive-date=20 June 2019|url-status=live}}</ref>
 
Data science is an interdisciplinary field focused on extracting knowledge from data sets, which are typically large (see [[big data]]).<ref>{{Cite web|url=http://www.datascienceassn.org/about-data-science|title=About Data Science {{!}} Data Science Association|website=www.datascienceassn.org|access-date=2020-04-03}}</ref> The field encompasses analysis, preparing data for analysis, and presenting findings to inform high-level decisions in an organization. As such, it incorporates skills from computer science, mathematics, statistics, [[information visualization]], graphic design, and business.<ref>{{Cite web|url=https://www.oreilly.com/library/view/doing-data-science/9781449363871/ch01.html|title=1. Introduction: What Is Data Science? - Doing Data Science [Book]|website=www.oreilly.com|language=en|access-date=2020-04-03}}</ref><ref>{{Cite web|url=https://medriscoll.com/post/4740157098/the-three-sexy-skills-of-data-geeks|title=the three sexy skills of data geeks|website=m.e.driscoll: data utopian|language=en|access-date=2020-04-03}}</ref> Statistician [[Nathan Yau]], drawing on [[Ben Fry]], also links data science to [[Human–computer interaction|human-computer interaction]]: users should be able to intuitively control and explore data.<ref>{{Cite web|url=https://flowingdata.com/2009/06/04/rise-of-the-data-scientist/|title=Rise of the Data Scientist|last=Yau|first=Nathan|date=2009-06-04|website=FlowingData|language=en|access-date=2020-04-03}}</ref><ref>{{Cite web|url=https://benfry.com/phd/dissertation/2.html|title=Basic Example|last=|first=|date=|website=benfry.com|url-status=live|archive-url=|archive-date=|access-date=2020-04-03}}</ref> In 2015, the [[American Statistical Association]] identified [[Database|database management]], statistics and [[machine learning]], and [[Distributed computing|distributed and parallel systems]] as the three emerging foundational professional communities.<ref>{{Cite web|url=https://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/|title=ASA Statement on the Role of Statistics in Data Science|date=2015-10-01|website=AMSTATNEWS|publisher=[[American Statistical Association]]|access-date=2019-05-29|archive-url=https://web.archive.org/web/20190620184935/https://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/|archive-date=20 June 2019|url-status=live}}</ref>
第53行: 第168行:       −
  −
  −
===Relationship to statistics===
  −
  −
===Relationship to statistics===
  −
  −
与统计学的关系
  −
  −
Many statisticians, including [[Nate Silver]], have argued that data science is not a new field, but rather another name for statistics.<ref>{{Cite web|url=https://www.statisticsviews.com/details/feature/5133141/Nate-Silver-What-I-need-from-statisticians.html|title=Nate Silver: What I need from statisticians - Statistics Views|website=www.statisticsviews.com|access-date=2020-04-03}}</ref> Others argue that data science is distinct from statistics because it focuses on problems and techniques unique to digital data.<ref>{{Cite web|url=http://priceonomics.com/whats-the-difference-between-data-science-and/|title=What's the Difference Between Data Science and Statistics?|website=Priceonomics|language=en|access-date=2020-04-03}}</ref> [[Vasant Dhar]] writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g. images) and emphasizes prediction and action.<ref>{{Cite journal|last=DharVasant|date=2013-12-01|title=Data science and prediction|journal=Communications of the ACM|volume=56|issue=12|pages=64–73|language=EN|doi=10.1145/2500499}}</ref> [[Andrew Gelman]] of Columbia University and data scientist Vincent Granville have described statistics as a nonessential part of data science.<ref>{{Cite web|url=https://statmodeling.stat.columbia.edu/2013/11/14/statistics-least-important-part-data-science/|title=Statistics is the least important part of data science « Statistical Modeling, Causal Inference, and Social Science|website=statmodeling.stat.columbia.edu|access-date=2020-04-03}}</ref><ref>{{Cite web|url=https://www.datasciencecentral.com/profiles/blogs/data-science-without-statistics-is-possible-even-desirable|title=Data science without statistics is possible, even desirable|last=Posted by Vincent Granville on December 8|first=2014 at 5:00pm|last2=Blog|first2=View|website=www.datasciencecentral.com|language=en|access-date=2020-04-03}}</ref>
  −
  −
Many statisticians, including Nate Silver, have argued that data science is not a new field, but rather another name for statistics. Others argue that data science is distinct from statistics because it focuses on problems and techniques unique to digital data. Vasant Dhar writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g. images) and emphasizes prediction and action. Andrew Gelman of Columbia University and data scientist Vincent Granville have described statistics as a nonessential part of data science.
  −
  −
包括纳特 · 西尔弗在内的许多统计学家都认为,数据科学不是一个新领域,而是统计学的另一个名称。其他人则认为数据科学不同于统计学,因为它专注于数字数据所特有的问题和技术。瓦桑特 · 达尔写道,统计学强调定量数据和描述。相比之下,数据科学研究的是定量和定性的数据。图片) ,并强调预测和行动。哥伦比亚大学的安德鲁 · 格尔曼和数据科学家文森特 · 格兰维尔将统计学描述为数据科学中不重要的部分。
  −
  −
  −
  −
  −
  −
Stanford professor [[David Donoho]] writes that data science is not distinguished from statistics by the size of datasets or use of computing, and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data science program. He describes data science as an applied field growing out of traditional statistics.<ref name=":7" />
  −
  −
Stanford professor David Donoho writes that data science is not distinguished from statistics by the size of datasets or use of computing, and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data science program. He describes data science as an applied field growing out of traditional statistics.
  −
  −
斯坦福大学教授 David Donoho 写道,数据科学与统计学之间并不存在数据集的大小或计算机的使用,许多研究生课程错误地宣传他们的分析学和统计学训练是数据科学课程的本质。他把数据科学描述为从传统统计学中发展出来的一个应用领域。
        第127行: 第219行:  
在20世纪90年代,在数据集中寻找模式的流行术语(数据集越来越大)包括“知识发现”和“数据挖掘”
 
在20世纪90年代,在数据集中寻找模式的流行术语(数据集越来越大)包括“知识发现”和“数据挖掘”
    +
  --[[用户:趣木木|趣木木]]([[用户讨论:趣木木|讨论]])下为旧版关于数据科学的词源演变由来的部分内容  可参考整合并进行填充
    +
“数据科学”这一术语在过去的三十年里已经出现在各种语境中,但直到最近才成为一个确定的术语。在早期,1960年它被[https://en.wikipedia.org/wiki/Peter_Naur 彼得·诺尔](Peter Naur)用作[https://en.wikipedia.org/wiki/Computer_science 计算机科学]的代名词。诺尔后来引入了[https://en.wikipedia.org/wiki/Datalogy “数据学”](datalogy)这一术语。
 +
<ref>
 +
{{cite journal
 +
|last1=Naur
 +
|first1=Peter
 +
|title=The science of datalogy
 +
|journal=Communications of the ACM
 +
|date=1 July 1966
 +
|volume=9
 +
|issue=7
 +
|pages=485
 +
|doi:10.1145/365719.366510
 +
}}</ref>
 +
在1974年,诺尔出版了《计算机方法简明调查》,在这本书对同时代被广泛应用的数据处理方法的调查中,他自如地使用了“数据科学”这一术语。
 +
 +
在1996年,国际分级社团联盟 (IFCS)的成员在日本神户举行了两年一次的会议,在此,术语“数据科学”在由林知己夫(Chikio Hayashi)
 +
<ref name="Hayashi">
 +
{{Cite book
 +
|chapter-url=https://link.springer.com/chapter/10.1007/978-4-431-65950-1_3
 +
|url=https://www.springer.com/book/9784431702085
 +
|title=Data Science, Classification, and Related Methods
 +
|last=Hayashi
 +
|first=Chikio
 +
|date=1998-01-01
 +
|publisher=Springer Japan
 +
|isbn:9784431702085
 +
|editor-last=Hayashi
 +
|editor-first=Chikio
 +
|series=Studies in Classification, Data Analysis, and Knowledge Organization
 +
|location=
 +
|pages=40–51
 +
|language=en
 +
|chapter=What is Data Science? Fundamental Concepts and a Heuristic Example
 +
|doi:10.1007/978-4-431-65950-1_3
 +
|editor-last2=Yajima
 +
|editor-first2=Keiji
 +
|editor-last3=Bock
 +
|editor-first3=Hans-Hermann
 +
|editor-last4=Ohsumi
 +
|editor-first4=Noboru
 +
|editor-last5=Tanaka
 +
|editor-first5=Yutaka
 +
|editor-last6=Baba
 +
|editor-first6=Yasumasa
 +
}}</ref>
 +
举办的圆桌讨论上得到介绍之后首次被纳入会议标题(“数据科学、分级、相关方法”)。
 +
<ref>
 +
Press, Gil.
 +
[https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/ "A Very Short History Of Data Science"].
 +
</ref>
 +
 +
在1997年11月,吴建福(C.F. Jeff Wu)为他被[https://en.wikipedia.org/wiki/University_of_Michigan 密歇根大学]给予的H.C Carver教授职位任命发表了题为“统计学=数据科学?”
 +
<ref name="cfjwutk">
 +
Wu, C. F. J. (1997).
 +
[http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf "Statistics = Data Science?"].
 +
Retrieved 9 October 2014.
 +
</ref>
 +
的就职演讲
 +
<ref name="cfjwu01">
 +
[http://ur.umich.edu/9899/Nov09_98/4.htm "Identity of statistics in science examined"]
 +
.The University Records, 9 November 1997, The University of Michigan. Retrieved 12 August 2013.
 +
</ref>
 +
,在演讲中他将统计学工作描述为数据收集、建模和分析、决策的三部曲。在结论中他首创了现代的、非计算机科学的“数据科学”术语用法,并提倡统计学应被更名为数据科学,统计学家应被称作数据科学家。
 +
<ref name="cfjwutk"/>
 +
之后,他又在1998年纪念印度科学家和统计学家、[https://en.wikipedia.org/wiki/Indian_Statistical_Institute 印度统计学院]创立者[https://en.wikipedia.org/wiki/Prasanta_Chandra_Mahalanobis 马哈拉诺比斯(P.C. Mahalanobis)]的讲座上将同名演讲作为其系列演讲
 +
<ref name="cfjwu02">
 +
[http://www.isical.ac.in/~statmath/html/pcm/pcm_recent.html "P.C. Mahalanobis Memorial Lectures, 7th series"].
 +
P.C. Mahalanobis Memorial Lectures, Indian Statistical Institute.
 +
Archived from [https://web.archive.org/web/20131029191813/ the original]
 +
on 26 Feb 2017. Retrieved 18 Jul 2017.
 +
</ref>
 +
的第一篇发表。
 +
 +
在2001年,威廉·克利夫兰(William S.Cleveland)在他的文章《数据科学:一个用来扩大统计学领域技术范畴的行动计划》将数据科学作为一门独立学科引入,扩大了统计学的领域并使之包含“数据计算的前沿”,这篇文章发表在2001年4月版的《国际统计评论》(''International Statistical Review / Revue Internationale de Statistique'')的第69卷,第1篇。
 +
<ref name="cleveland01">
 +
Cleveland, W. S. (2001).
 +
[https://pdfs.semanticscholar.org/915c/d8e2b39eb02723553913d592b2237d4d9960.pdf Data science: an action plan for expanding the technical areas of the field of statistics].
 +
International Statistical Review / Revue Internationale de Statistique, 21–26.
 +
</ref>
 +
在他的报告中,克利夫兰建立了他认为数据科学所围绕的6个技术领域:多学科调查,数据模型和方法,数据计算,教学法、工具评估和理论。
 +
 +
在2002年4月,国际科学委员会(ICSU):数据科学与技术分会(CODATA)
 +
<ref name="ics12">
 +
International Council for Science : Committee on Data for Science and Technology. (2012, April).
 +
CODATA, The Committee on Data for Science and Technology. Retrieved from International Council for Science : Committee on Data for Science and Technology: http://www.codata.org/
 +
</ref>
 +
创办了数据科学期刊(''Data Science Journal'')
 +
<ref name="dsj12">
 +
Data Science Journal. (2012, April).
 +
Available Volumes.
 +
Retrieved from Japan Science and Technology Information Aggregator, Electronic: http://www.jstage.jst.go.jp/browse/dsj/_vols
 +
</ref>
 +
,这是一份聚焦于诸如数据系统描述、网络出版物、应用和法律问题的出版物
 +
<ref name="dsj02">
 +
Data Science Journal. (2002, April).
 +
Contents of Volume 1, Issue 1, April 2002.
 +
Retrieved from Japan Science and Technology Information Aggregator,
 +
Electronic: http://www.jstage.jst.go.jp/browse/dsj/1/0/_contents
 +
</ref>
 +
。之后不久,哥伦比亚大学在2003年1月开始出版数据科学期刊(''The Journal of Data Science'')
 +
<ref name="jds03">
 +
The Journal of Data Science. (2003, January).
 +
Contents of Volume 1, Issue 1, January 2003.
 +
Retrieved from http://www.jds-online.com/v1-1
 +
</ref>
 +
,为所有数据工作者提供了发表意见和交流想法的平台。这份期刊衷心致力于统计学方法应用和定量研究。在2005年,国家科学委员会出版了“长期数字数据收集:赋能21世纪的研究和教育”,定义数据科学家为“信息和计算机科学家、数据库和软件程序员、学科专家、管理者和注释专家、图书管理员、档案保管员,以及其它对数字化数据收集的成功管理起到关键性作用的人。”他们的首要活动是“进行创造性探究与分析。”
 +
<ref>
 +
National Science Board.
 +
[http://www.nsf.gov/pubs/2005/nsb0540/ Long-Lived Digital Data Collections Enabling Research and Education in the 21st Century]
 +
. National Science Foundation
 +
. Retrieved 30 June 2013.
 +
</ref>
 +
 +
在2007年左右,
 +
<ref>
 +
Citation needed
 +
</ref>
 +
图灵奖得主[https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist) 吉姆·格雷](Jim Gray)预见到使用大数据的分析计算作为主要科学方法的“数据驱动的科学”将成为科学的第四范式
 +
<ref name="TansleyTolle2009" />
 +
<ref name="BellHey2009" />
 +
,我们将迎来一个科学文献、科学数据全部在线且彼此利用的世界。
 +
<ref>
 +
Markoff,John(2009-12-14).
 +
[https://www.nytimes.com/2009/12/15/science/15books.html "Essays Inspired by Microsoft’s Jim Gray, Who Saw Science Paradigm Shift"].                      ''The New York Times''. Retrieved 2018-04-26.
 +
</ref>
 +
 +
在2012年[https://en.wikipedia.org/wiki/Harvard_Business_Review 《哈佛商业评论》]的报道“数据科学家:21世纪最富有魅力的工作”中
 +
<ref name="Harvard">
 +
{{Cite journal
 +
|url=https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
 +
|title=Data Scientist: The Sexiest Job of the 21st Century
 +
|publisher=Harvard Business Review
 +
|first=Thomas H.
 +
|last=Davenport
 +
|first2=DJ
 +
|last2=Patil
 +
|date=Oct 2012
 +
}}</ref>
 +
,[https://en.wikipedia.org/wiki/DJ_Patil 帕蒂尔](DJ Patil)声称其已于2008年和[https://en.wikipedia.org/wiki/Jeff_Hammerbacher 杰弗·哈梅巴赫](Jeff Hammerbacher)共同创造了这一术语,用以标注他们在领英和脸书上的职业信息。他断言数据科学家将是一种全新的职业类型,并且数据科学家的短缺正成为某些领域的严重掣肘,但同时也将其描述为一个更加商业化导向的角色。
 +
 +
2013年,IEEE数据科学和高等分析专门工作组
 +
<ref>
 +
[http://www.dsaa.co "IEEE Task Force on Data Science and Advanced Analytics"]
 +
</ref>
 +
成立,同年第一届“欧洲数据分析大会(ECDA)”在卢森堡召开,会上成立了[http://euads.org/ 欧洲数据科学协会](EuADS)。第一届国际会议——IEEE国际数据科学和高等分析会议于2014年召开。
 +
<ref>
 +
[http://datamining.it.uts.edu.au/conferences/dsaa14/ "2014 IEEE International Conference on Data Science and Advanced Analytics"]
 +
</ref>
 +
同年,编程训练营始祖[https://en.wikipedia.org/wiki/General_Assembly_(school) General Assembly]启动了学生付费培训,[https://en.wikipedia.org/wiki/The_Data_Incubator 数据孵化器公司]成立了一个富有竞争力的自由数据科学团体。
 +
<ref>
 +
[https://venturebeat.com/2014/04/15/ny-gets-new-bootcamp-for-data-scientists-its-free-but-harder-to-get-into-than-harvard/ "NY gets new bootcamp for data scientists: It’s free, but harder to get into than Harvard "]. 
 +
''Venture Beat'' Retrieved 2016-02-22.
 +
</ref>
 +
也是在2014年,[https://en.wikipedia.org/wiki/American_Statistical_Association 美国统计协会]的统计学习和数据挖掘部门将其期刊更名为“统计分析与数据挖掘:ASA数据科学期刊”,并在2016年将其部门更名为“统计学习与数据科学”。
 +
<ref name="ASA">
 +
Talley,Jill(2016-06-01)
 +
[http://magazine.amstat.org/blog/2016/06/01/datascience-2/ "ASA Expands Scope, Outreach to Foster Growth, Collaboration in Data Science"]
 +
. ''AMSTATNEWS''.
 +
American Statistical Association.
 +
Retrieved 2017-02-04
 +
</ref>
 +
2015年,Springer创办国际数据科学与分析杂志
 +
<ref>
 +
[https://www.springer.com/41060 "Journal on Data Science and Analytics"]
 +
</ref>
 +
,用来出版有关数据科学和大数据分析方面的原创性工作。2015年9月,[http://www.gfkl.org/welcome/ GfKI]在英国克彻斯特的[https://en.wikipedia.org/wiki/University_of_Essex 埃塞克斯大学]举办的第三届ECDA大会上增设“数据科学社团”。
      第328行: 第587行:        +
===Relationship to statistics===
 +
 +
===Relationship to statistics===
 +
 +
与统计学的关系
 +
 +
Many statisticians, including [[Nate Silver]], have argued that data science is not a new field, but rather another name for statistics.<ref>{{Cite web|url=https://www.statisticsviews.com/details/feature/5133141/Nate-Silver-What-I-need-from-statisticians.html|title=Nate Silver: What I need from statisticians - Statistics Views|website=www.statisticsviews.com|access-date=2020-04-03}}</ref> Others argue that data science is distinct from statistics because it focuses on problems and techniques unique to digital data.<ref>{{Cite web|url=http://priceonomics.com/whats-the-difference-between-data-science-and/|title=What's the Difference Between Data Science and Statistics?|website=Priceonomics|language=en|access-date=2020-04-03}}</ref> [[Vasant Dhar]] writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g. images) and emphasizes prediction and action.<ref>{{Cite journal|last=DharVasant|date=2013-12-01|title=Data science and prediction|journal=Communications of the ACM|volume=56|issue=12|pages=64–73|language=EN|doi=10.1145/2500499}}</ref> [[Andrew Gelman]] of Columbia University and data scientist Vincent Granville have described statistics as a nonessential part of data science.<ref>{{Cite web|url=https://statmodeling.stat.columbia.edu/2013/11/14/statistics-least-important-part-data-science/|title=Statistics is the least important part of data science « Statistical Modeling, Causal Inference, and Social Science|website=statmodeling.stat.columbia.edu|access-date=2020-04-03}}</ref><ref>{{Cite web|url=https://www.datasciencecentral.com/profiles/blogs/data-science-without-statistics-is-possible-even-desirable|title=Data science without statistics is possible, even desirable|last=Posted by Vincent Granville on December 8|first=2014 at 5:00pm|last2=Blog|first2=View|website=www.datasciencecentral.com|language=en|access-date=2020-04-03}}</ref>
 +
 +
Many statisticians, including Nate Silver, have argued that data science is not a new field, but rather another name for statistics. Others argue that data science is distinct from statistics because it focuses on problems and techniques unique to digital data. Vasant Dhar writes that statistics emphasizes quantitative data and description. In contrast, data science deals with quantitative and qualitative data (e.g. images) and emphasizes prediction and action. Andrew Gelman of Columbia University and data scientist Vincent Granville have described statistics as a nonessential part of data science.
 +
 +
包括纳特 · 西尔弗在内的许多统计学家都认为,数据科学不是一个新领域,而是统计学的另一个名称。其他人则认为数据科学不同于统计学,因为它专注于数字数据所特有的问题和技术。瓦桑特 · 达尔写道,统计学强调定量数据和描述。相比之下,数据科学研究的是定量和定性的数据。图片) ,并强调预测和行动。哥伦比亚大学的安德鲁 · 格尔曼和数据科学家文森特 · 格兰维尔将统计学描述为数据科学中不重要的部分。
 +
 +
 +
 +
 +
 +
Stanford professor [[David Donoho]] writes that data science is not distinguished from statistics by the size of datasets or use of computing, and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data science program. He describes data science as an applied field growing out of traditional statistics.<ref name=":7" />
 +
 +
Stanford professor David Donoho writes that data science is not distinguished from statistics by the size of datasets or use of computing, and that many graduate programs misleadingly advertise their analytics and statistics training as the essence of a data science program. He describes data science as an applied field growing out of traditional statistics.
 +
 +
斯坦福大学教授 David Donoho 写道,数据科学与统计学之间并不存在数据集的大小或计算机的使用,许多研究生课程错误地宣传他们的分析学和统计学训练是数据科学课程的本质。他把数据科学描述为从传统统计学中发展出来的一个应用领域。
 +
 +
 +
  --[[用户:趣木木|趣木木]]([[用户讨论:趣木木|讨论]])下为旧版词条中对应部分内容 可进行整合参考并填充
 +
飞速增长的职位空缺表明“数据科学”的概念在商业界和学术界可谓一夜蹿红。
 +
<ref>
 +
Darrow,Barb(May 21, 2015).
 +
[http://fortune.com/2015/05/21/data-science-white-hot/ "Data science is still white hot, but nothing lasts forever"]
 +
.''Fortune.'' Retrieved November 20, 2017.
 +
</ref>
 +
然而许多持批判态度的学者和新闻记者并没有看出数据科学与[https://en.wikipedia.org/wiki/Statistics 统计学]的区别。吉尔·普莱斯(Gil Press)在[https://en.wikipedia.org/wiki/Forbes 福布斯杂志]上撰文主张数据科学只是一个缺乏清晰定义的[https://en.wikipedia.org/wiki/Buzzword 流行术语],并且在诸如研究生的课程内容中成了“[https://en.wikipedia.org/wiki/Business_analytics 商业分析]”的简单替代。
 +
<ref name="GilPress">
 +
[https://www.forbes.com/sites/gilpress/2013/08/19/data-science-whats-the-half-life-of-a-buzzword/ Data Science: What's The Half-Life Of A Buzzword?].
 +
Forbes.2013-08-19.
 +
</ref>
 +
在[https://en.wikipedia.org/wiki/American_Statistical_Association 美国统计协会]的联合统计学会议上发表主旨演说后的问答部分,著名应用统计学家[https://en.wikipedia.org/wiki/Nate_Silver 内特·西尔弗](Nate Silver)说道:“我认为数据科学家对于统计学家是一个富有魅力的词语…统计学是科学的一条分支。数据科学家在某种意义上略显多余,而且人们不应该痛斥统计学家这个词。”
 +
<ref name="NateSilver">
 +
[http://www.statisticsviews.com/details/feature/5133141/Nate-Silver-What-I-need-from-statisticians.html "Nate Silver: What I need from statisticians"]. 23 Aug 2013
 +
</ref>
 +
同样,在商业领域,各方研究者和分析师表示,仅仅有数据科学家远远不足以赋予公司真正的竞争优势,
 +
<ref>
 +
{{Cite journal
 +
|last=Miller
 +
|first=Steven
 +
|date=2014-04-10
 +
|title=Collaborative Approaches Needed to Close the Big Data Skills Gap
 +
|url=http://www.jorgdesign.net/article/view/9823
 +
|journal=Journal of Organization Design
 +
|language=en
 +
|volume=3
 +
|issue=1
 +
|pages=26–30
 +
|doi:10.7146/jod.9823
 +
|issn:2245-408X
 +
}}</ref>
 +
而且,仅仅把数据科学家看作四项更伟大的工作种类之一,各公司需要为大数据进行有效的融资,亦即:数据分析师、数据科学家、大数据[https://en.wikipedia.org/wiki/Software_Developer 开发者]和大数据[https://en.wikipedia.org/wiki/Software_engineer 工程师]。
 +
<ref>
 +
{{Cite journal
 +
|last=De Mauro
 +
|first=Andrea
 +
|last2=Greco
 +
|first2=Marco
 +
|last3=Grimaldi
 +
|first3=Michele
 +
|last4=Ritala
 +
|first4=Paavo
 +
|title=Human resources for Big Data professions: A systematic classification of job roles and required skill sets
 +
|url=http://linkinghub.elsevier.com/retrieve/pii/S0306457317300018
 +
|journal=Information Processing & Management
 +
|doi:10.1016/j.ipm.2017.05.004
 +
}}</ref>
    +
另一方面,也有无数对批评的回应。在2014年一篇[https://en.wikipedia.org/wiki/The_Wall_Street_Journal 《华尔街日报》]的文章中,欧文·沃拉达斯凯-伯杰(Irving Wladawsky-Berger)比较了数据科学的狂热与[https://en.wikipedia.org/wiki/Computer_science 计算机科学]的黎明。他坚称,就像其他[https://en.wikipedia.org/wiki/Interdisciplinarity 交叉学科]领域一样,数据科学利用来自[https://en.wikipedia.org/wiki/Academy 学术界]和[https://en.wikipedia.org/wiki/Industry 工业界]的[https://en.wikipedia.org/wiki/Methodology 方法论]和实践,但之后会将它们变成一个新[https://en.wikipedia.org/wiki/Discipline_(academia) 学科]。他特别强调了现在一个广受认可的学术科目计算机科学曾面临的尖锐批评。
 +
<ref name=":1">
 +
Wladawsky-Berger,Irving (May 2, 2014).
 +
[https://blogs.wsj.com/cio/2014/05/02/why-do-we-need-data-science-when-weve-had-statistics-for-centuries/ "Why Do We Need Data Science When We’ve Had Statistics for Centuries?"].
 +
''The Wall Street Journal''. Retrieved November 20, 2017.
 +
</ref>
 +
类似地,就像许多其他数据科学学界支持者一样,
 +
<ref name=":1" />
 +
[https://en.wikipedia.org/wiki/New_York_University 纽约大学][https://en.wikipedia.org/wiki/NYU_Stern_Center_for_Business_and_Human_Rights 斯特恩商学院]的瓦桑德·达尔(Vasant Dhar)在2013年12月更加明确地表示数据科学与现存的仅仅聚焦于解释[https://en.wikipedia.org/wiki/Data_set 数据集]的横跨所有[https://en.wikipedia.org/wiki/Discipline_(academia) 学科]的数据分析实践不同。数据科学为[https://en.wikipedia.org/wiki/Predictive_modelling 预测模型]寻求了可行和一致的[https://en.wikipedia.org/wiki/Pattern_recognition 模式]。
 +
<ref name=":0" />
 +
这项实际的工程目标采用了超越了传统[https://en.wikipedia.org/wiki/Analytics 数据分析]的数据科学。如今这些学科和[https://en.wikipedia.org/wiki/Applied_science 应用领域]的数据缺乏可靠[https://en.wikipedia.org/wiki/Theory 理论]以供形成有力的预测模型,就像[https://en.wikipedia.org/wiki/Health_science 健康科学]和[https://en.wikipedia.org/wiki/Social_science 社会科学]那样。
 +
<ref name=":0" />
    +
斯坦福大学教授[https://en.wikipedia.org/wiki/David_Donoho 大卫·多诺霍](David Donoho)于2015年9月在一次与达尔类似的尝试中,通过抵制批评界对数据科学的三种过分简单化和误导性的定义,提出了更长远的主张。
 +
<ref name=":2">
 +
{{Cite journal
 +
|last=Donoho
 +
|first=David
 +
|date=September 2015
 +
|title=50 Years of Data Science
 +
|url=http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf
 +
|journal=Based on a talk at Tukey Centennial workshop, Princeton NJ Sept 18 2015
 +
|volume=
 +
|pages=
 +
|via=
 +
}}</ref>
 +
第一,对多诺霍而言,数据科学不等同于[https://en.wikipedia.org/wiki/Big_data 大数据],因为数据集的规模不是区分数据科学和统计学的标准。
 +
<ref name=":2" />
 +
第二,数据科学不是由将大数据集分类整理的[https://en.wikipedia.org/wiki/Computing 计算]技能定义的,因为这些技能已经被广泛地用作所有学科的分析。
 +
<ref name=":2" />
 +
第三,数据科学现在是一个[https://en.wikipedia.org/wiki/Graduate_school 学术项目]尚不足以给数据科学家日后的工作提供充足准备,而已然得到大量应用的领域,因为许多[https://en.wikipedia.org/wiki/Graduate_school 研究生项目]带有误导性地宣传他们的分析和统计学训练是一个数据科学项目的实质。
 +
<ref name=":2" />
 +
<ref>
 +
{{Cite book
 +
|title=The Culture of Big Data
 +
|last=Barlow
 +
|first=Mike
 +
|publisher=O'Reilly Media, Inc.
 +
|year=2013
 +
|isbn=
 +
|location=
 +
|pages=
 +
}}</ref>
 +
作为一名[https://en.wikipedia.org/wiki/Statistician 统计学家],[https://en.wikipedia.org/wiki/David_Donoho 多诺霍]继承了学界诸多前辈的衣钵,拥护着数据科学研究范围的扩充,
 +
<ref name=":2" />
 +
就像约翰·钱伯斯(John Chambers)极力主张统计学家采用一种包容的从数据中学习的概念、
 +
<ref>
 +
{{Cite journal
 +
|last=Chambers
 +
|first=John M.
 +
|date=1993-12-01
 +
|title=Greater or lesser statistics: a choice for future research
 +
|url=https://link.springer.com/article/10.1007/BF00141776
 +
|journal=Statistics and Computing
 +
|language=en
 +
|volume=3
 +
|issue=4
 +
|pages=182–184
 +
|doi:10.1007/BF00141776
 +
|issn:0960-3174
 +
}}</ref>
 +
威廉·克利夫兰(William Cleveland)强调把从数据中提取具有应用价值的[https://en.wikipedia.org/wiki/Predictive_modelling 预测工具]摆在比发掘[https://en.wikipedia.org/wiki/Explanatory_model 解释性理论]更高的优先级上一样。
 +
<ref name="cleveland01" />
 +
这些[https://en.wikipedia.org/wiki/Statistician 统计学家]们共同展望着一个日益包容、从传统的[https://en.wikipedia.org/wiki/Statistics 统计学]中生长出来并青出于蓝而胜于蓝的应用领域。
    +
为了数据科学的未来,多诺霍为[https://en.wikipedia.org/wiki/Open_science 开放性科学]规划了一个不断成长的环境,使所有研究者都可以访问用于[https://en.wikipedia.org/wiki/Academic_publishing 学术出版物]的数据集。
 +
<ref name=":2" />
 +
[https://en.wikipedia.org/wiki/National_Institutes_of_Health 美国国家卫生研究院]已经宣布了提高研究数据再现性和透明度的计划。
 +
<ref>
 +
{{Cite journal
 +
|last=Collins
 +
|first=Francis S.
 +
|last2=Tabak
 +
|first2=Lawrence A.
 +
|date=2014-01-30
 +
|title=NIH plans to enhance reproducibility
 +
|journal=Nature
 +
|volume=505
 +
|issue=7485
 +
|pages=612–613
 +
|issn:0028-0836
 +
|pmc:4058759
 +
|pmid:24482835
 +
|doi:10.1038/505612a
 +
}}</ref>
 +
其它的大型[https://en.wikipedia.org/wiki/Academic_journal 期刊]亦紧随其后。
 +
<ref>
 +
{{Cite journal
 +
|last=McNutt
 +
|first=Marcia
 +
|date=2014-01-17
 +
|title=Reproducibility
 +
|url=http://science.sciencemag.org/content/343/6168/229
 +
|journal=Science
 +
|language=en
 +
|volume=343
 +
|issue=6168
 +
|pages=229–229
 +
|doi:10.1126/science.1250475
 +
|issn:0036-8075
 +
|pmid:24436391
 +
}}</ref>
 +
<ref>
 +
{{Cite journal
 +
|last=Peng
 +
|first=Roger D.
 +
|date=2009-07-01
 +
|title=Reproducible research and Biostatistics
 +
|url=https://academic.oup.com/biostatistics/article/10/3/405/293660
 +
|journal=Biostatistics
 +
|language=en
 +
|volume=10
 +
|issue=3
 +
|pages=405–408
 +
|doi:10.1093/biostatistics/kxp014
 +
|issn:1465-4644
 +
}}</ref>
 +
这样,数据科学的未来不仅在规模和方法论上超越了[https://en.wikipedia.org/wiki/Statistical_theory 统计学理论]的界线,它还会彻底革新现在的学术和[https://en.wikipedia.org/wiki/Paradigm 研究范式]。
 +
<ref name=":2" />
 +
诚如多诺霍所言蔽之:“数据科学的范围和影响在今后数十年会继续扩充,科研数据和有关科学本身的数据将无处不在、俯拾即是。”
 +
<ref name=":2" />
     
579

个编辑