更改

大数据 (查看源代码)

2022年3月6日 (日) 18:35的版本

删除4,288字节、 2022年3月6日 (日) 18:35

→‎Science

第273行：第273行： −

===~~Science~~===

+

===科学===

−

* The [[Large Hadron Collider]] experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995%<ref>{{cite web|last1=Alexandru|first1=Dan|title=Prof|url=https://cds.cern.ch/record/1504817/files/CERN-THESIS-2013-004.pdf|website=cds.cern.ch|publisher=CERN|access-date=24 March 2015}}</ref> of these streams, there are 1,000 collisions of interest per second.<ref>{{cite web |title=LHC Brochure, English version. A presentation of the largest and the most powerful particle accelerator in the world, the Large Hadron Collider (LHC), which started up in 2008. Its role, characteristics, technologies, etc. are explained for the general public. |url=http://cds.cern.ch/record/1278169?ln=en |work=CERN-Brochure-2010-006-Eng. LHC Brochure, English version. |publisher=CERN |access-date=20 January 2013}}</ref><ref>{{cite web |title=LHC Guide, English version. A collection of facts and figures about the Large Hadron Collider (LHC) in the form of questions and answers. |url=http://cds.cern.ch/record/1092437?ln=en |work=CERN-Brochure-2008-001-Eng. LHC Guide, English version. |publisher=CERN |access-date=20 January 2013}}</ref><ref name="nature">{{cite news |title=High-energy physics: Down the petabyte highway |work= Nature |date= 19 January 2011 |first=Geoff |last=Brumfiel |doi= 10.1038/469282a |volume= 469 |pages= 282–83 |url= http://www.nature.com/news/2011/110119/full/469282a.html |bibcode=2011Natur.469..282B }}</ref>

−

** As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication ({{as of|2012|lc=y}}). This becomes nearly 200 petabytes after replication.

−

** If all sensor data were recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 [[exabyte]]s per day, before replication. To put the number in perspective, this is equivalent to 500 [[quintillion]] (5×10<sup>20</sup>) bytes per day, almost 200 times more than all the other sources combined in the world.

−

* The [[Square Kilometre Array]] is a radio telescope built of thousands of antennas. It is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one petabyte per day.<ref>{{cite web|url= http://www.zurich.ibm.com/pdf/astron/CeBIT+2013+Background+DOME.pdf|title=IBM Research – Zurich| website=Zurich.ibm.com|access-date=8 October 2017}}</ref><ref>{{cite web|url =https://arstechnica.com/science/2012/04/future-telescope-array-drives-development-of-exabyte-processing/|title=Future telescope array drives development of Exabyte processing|work=Ars Technica |date=2 April 2012|access-date=15 April 2015}}</ref> It is considered one of the most ambitious scientific projects ever undertaken.<ref>{{cite web|url=http://theconversation.com/australias-bid-for-the-square-kilometre-array-an-insiders-perspective-4891|title=Australia's bid for the Square Kilometre Array – an insider's perspective|date=1 February 2012|publisher=[[The Conversation (website)|The Conversation]]|access-date=27 September 2016}}</ref>

−

* When the [[Sloan Digital Sky Survey]] (SDSS) began to collect astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy previously. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information.<ref name="Economist">{{cite news |title=Data, data everywhere |url=http://www.economist.com/node/15557443 |newspaper=The Economist |date=25 February 2010 |access-date=9 December 2012}}</ref> When the [[Large Synoptic Survey Telescope]], successor to SDSS, comes online in 2020, its designers expect it to acquire that amount of data every five days.{{r|Economist}}

−

*[[Human Genome Project|Decoding the human genome]] originally took 10 years to process; now it can be achieved in less than a day. The DNA sequencers have divided the sequencing cost by 10,000 in the last ten years, which is 100 times cheaper than the reduction in cost predicted by [[Moore's law]].<ref>{{cite web|url=http://www.oecd.org/sti/ieconomy/Session_3_Delort.pdf#page=6|title=Delort P., OECD ICCP Technology Foresight Forum, 2012.|website=Oecd.org|access-date=8 October 2017}}</ref>

−

* The [[NASA]] Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster.<ref>{{cite web|url=http://www.nasa.gov/centers/goddard/news/releases/2010/10-051.html|title=NASA – NASA Goddard Introduces the NASA Center for Climate Simulation|website=Nasa.gov|access-date=13 April 2016}}</ref><ref>{{cite web|last=Webster |first=Phil|title=Supercomputing the Climate: NASA's Big Data Mission| url=http://www.csc.com/cscworld/publications/81769/81773-supercomputing_the_climate_nasa_s_big_data_mission |work=CSC World|publisher=Computer Sciences Corporation|access-date=18 January 2013|url-status=dead| archive-url =https://web.archive.org/web/20130104220150/http://www.csc.com/cscworld/publications/81769/81773-supercomputing_the_climate_nasa_s_big_data_mission|archive-date=4 January 2013}}</ref>

−

* Google's DNAStack compiles and organizes DNA samples of genetic data from around the world to identify diseases and other medical defects. These fast and exact calculations eliminate any "friction points", or human errors that could be made by one of the numerous science and biology experts working with the DNA. DNAStack, a part of Google Genomics, allows scientists to use the vast sample of resources from Google's search server to scale social experiments that would usually take years, instantly.<ref>{{cite news| url=https://www.theglobeandmail.com/life/health-and-fitness/health/these-six-great-neuroscience-ideas-could-make-the-leap-from-lab-to-market/article21681731/|title=These six great neuroscience ideas could make the leap from lab to market|date=20 November 2014|work=[[The Globe and Mail]]|access-date=1 October 2016}}</ref><ref>{{cite web|url=https://cloud.google.com/customers/dnastack/|title=DNAstack tackles massive, complex DNA datasets with Google Genomics|publisher=Google Cloud Platform |access-date=1 October 2016}}</ref>

−

* [[23andme]]'s [[DNA database]] contains the genetic information of over 1,000,000 people worldwide.<ref>{{cite web|title=23andMe – Ancestry|url=https://www.23andme.com/en-int/ancestry/| website=23andme.com| access-date=29 December 2016}}</ref> The company explores selling the "anonymous aggregated genetic data" to other researchers and pharmaceutical companies for research purposes if patients give their consent.<ref name=verge1>{{cite web|last1=Potenza|first1=Alessandra| title=23andMe wants researchers to use its kits, in a bid to expand its collection of genetic data|url=https://www.theverge.com/2016/7/13/12166960/23andme-genetic-testing-database-genotyping-research|website=The Verge|access-date=29 December 2016|date=13 July 2016}}</ref><ref>{{cite magazine| title=This Startup Will Sequence Your DNA, So You Can Contribute To Medical Research |url= https://www.fastcompany.com/3066775/innovation-agents/this-startup-will-sequence-your-dna-so-you-can-contribute-to-medical-resea|magazine=[[Fast Company]]|access-date=29 December 2016|date=23 December 2016}}</ref><ref>{{cite magazine|last1=Seife|first1=Charles|title=23andMe Is Terrifying, but Not for the Reasons the FDA Thinks|url=https://www.scientificamerican.com/article/23andme-is-terrifying-but-not-for-the-reasons-the-fda-thinks/|magazine=[[Scientific American]]|access-date=29 December 2016}}</ref><ref>{{cite web|last1=Zaleski|first1=Andrew|title=This biotech start-up is betting your genes will yield the next wonder drug|url=https://www.cnbc.com/2016/06/22/23andme-thinks-your-genes-are-the-key-to-blockbuster-drugs.html|publisher=CNBC|access-date=29 December 2016|date=22 June 2016}}</ref><ref>{{cite magazine|last1=Regalado|first1=Antonio|title=How 23andMe turned your DNA into a $1 billion drug discovery machine|url=https://www.technologyreview.com/s/601506/23andme-sells-data-for-drug-search/|magazine=[[MIT Technology Review]]|access-date=29 December 2016}}</ref> Ahmad Hariri, professor of psychology and neuroscience at [[Duke University]] who has been using 23andMe in his research since 2009 states that the most important aspect of the company's new service is that it makes genetic research accessible and relatively cheap for scientists.<ref name=verge1/> A study that identified 15 genome sites linked to depression in 23andMe's database lead to a surge in demands to access the repository with 23andMe fielding nearly 20 requests to access the depression data in the two weeks after publication of the paper.<ref>{{cite web|title=23andMe reports jump in requests for data in wake of Pfizer depression study {{!}} FierceBiotech |url =http://www.fiercebiotech.com/it/23andme-reports-jump-requests-for-data-wake-pfizer-depression-study| website=fiercebiotech.com|access-date=29 December 2016}}</ref>

−

*Computational fluid dynamics ([[Computational fluid dynamics|CFD]]) and hydrodynamic [[turbulence]] research generate massive data sets. The Johns Hopkins Turbulence Databases ([http://turbulence.pha.jhu.edu JHTDB]) contains over 350 terabytes of spatiotemporal fields from Direct Numerical simulations of various turbulent flows. Such data have been difficult to share using traditional methods such as downloading flat simulation output files. The data within JHTDB can be accessed using "virtual sensors" with various access modes ranging from direct web-browser queries, access through Matlab, Python, Fortran and C programs executing on clients' platforms, to cut out services to download raw data. The data have been used in over 150 scientific publications.

−

* 大型强子对撞机的实验有着大约1.5亿个传感器每秒传送4000万次数据。每秒有近6亿次碰撞。在过滤并避免记录超过99.99995%~~的流之后，每秒有1000次感兴趣的碰撞。~~

+

* 大型强子对撞机的实验有着大约1.5亿个传感器每秒传送4000万次数据。每秒有近6亿次碰撞。在过滤并避免记录超过99.99995%的流之后，<ref>{{cite web|last1=Alexandru|first1=Dan|title=Prof|url=https://cds.cern.ch/record/1504817/files/CERN-THESIS-2013-004.pdf|website=cds.cern.ch|publisher=CERN|access-date=24 March 2015}}</ref>每秒有1000次感兴趣的碰撞。<ref>{{cite web |title=LHC Brochure, English version. A presentation of the largest and the most powerful particle accelerator in the world, the Large Hadron Collider (LHC), which started up in 2008. Its role, characteristics, technologies, etc. are explained for the general public. |url=http://cds.cern.ch/record/1278169?ln=en |work=CERN-Brochure-2010-006-Eng. LHC Brochure, English version. |publisher=CERN |access-date=20 January 2013}}</ref><ref>{{cite web |title=LHC Guide, English version. A collection of facts and figures about the Large Hadron Collider (LHC) in the form of questions and answers. |url=http://cds.cern.ch/record/1092437?ln=en |work=CERN-Brochure-2008-001-Eng. LHC Guide, English version. |publisher=CERN |access-date=20 January 2013}}</ref><ref name="nature">{{cite news |title=High-energy physics: Down the petabyte highway |work= Nature |date= 19 January 2011 |first=Geoff |last=Brumfiel |doi= 10.1038/469282a |volume= 469 |pages= 282–83 |url= http://www.nature.com/news/2011/110119/full/469282a.html |bibcode=2011Natur.469..282B }}</ref>

** 因此，仅使用不到0.001%的传感器流数据，所有四个LHC实验的数据流在复制前代表25 PB的年速率。复制后，这将变成近200 PB。

** 如果所有传感器数据都记录在LHC中，数据流将非常难以处理。在复制之前，数据流的年速率将超过1.5亿PB，即每天近500 EB。从长远来看，这个数字相当于每天500五百万（5×1020）字节，几乎是世界上所有其他数据源总和的200倍。

−

* 平方公里阵列（Square Kilometre Array）是一个由数千根天线组成的射电望远镜。预计将于2024年投入使用。这些天线的总容量预计为14 EB，每天存储1 ~~PB。它被认为是有史以来最雄心勃勃的科学项目之一。~~

+

* 平方公里阵列（Square Kilometre Array）是一个由数千根天线组成的射电望远镜。预计将于2024年投入使用。这些天线的总容量预计为14 EB，每天存储1 PB。<ref>{{cite web|url= http://www.zurich.ibm.com/pdf/astron/CeBIT+2013+Background+DOME.pdf|title=IBM Research – Zurich| website=Zurich.ibm.com|access-date=8 October 2017}}</ref><ref>{{cite web|url =https://arstechnica.com/science/2012/04/future-telescope-array-drives-development-of-exabyte-processing/|title=Future telescope array drives development of Exabyte processing|work=Ars Technica |date=2 April 2012|access-date=15 April 2015}}</ref>它被认为是有史以来最雄心勃勃的科学项目之一。<ref>{{cite web|url=http://theconversation.com/australias-bid-for-the-square-kilometre-array-an-insiders-perspective-4891|title=Australia's bid for the Square Kilometre Array – an insider's perspective|date=1 February 2012|publisher=[[The Conversation (website)|The Conversation]]|access-date=27 September 2016}}</ref>

−

* 斯隆数字天空测量（SDSS）在2000年开始收集天文数据时，它在最初几周收集的数据比之前天文学史上收集的所有数据都多。SDS以每晚约200 GB的速度运行，已经积累了超过140 ~~TB的信息。当SDSS的后继者大型天气观测望远镜在2020年上线时，其设计者预计它将每五天获取如此数量的数据。~~

+

* 斯隆数字天空测量（SDSS）在2000年开始收集天文数据时，它在最初几周收集的数据比之前天文学史上收集的所有数据都多。SDS以每晚约200 GB的速度运行，已经积累了超过140 TB的信息。<ref name="Economist">{{cite news |title=Data, data everywhere |url=http://www.economist.com/node/15557443 |newspaper=The Economist |date=25 February 2010 |access-date=9 December 2012}}</ref> 当SDSS的后继者大型天气观测望远镜在2020年上线时，其设计者预计它将每五天获取如此数量的数据。

−

* 解码人类基因组最初需要10年的时间；现在不到一天就可以实现。在过去十年中，DNA测序仪将测序成本除以10000，比摩尔定律预测的成本低100倍。

+

* 解码人类基因组最初需要10年的时间；现在不到一天就可以实现。在过去十年中，DNA测序仪将测序成本除以10000，比摩尔定律预测的成本低100倍。<ref>{{cite web|url=http://www.oecd.org/sti/ieconomy/Session_3_Delort.pdf#page=6|title=Delort P., OECD ICCP Technology Foresight Forum, 2012.|website=Oecd.org|access-date=8 October 2017}}</ref>

−

* 美国国家航空航天局气候模拟中心（NCCS）在探索超级计算集群上存储了32 PB的气候观测和模拟数据。

+

* 美国国家航空航天局气候模拟中心（NCCS）在探索超级计算集群上存储了32 PB的气候观测和模拟数据。<ref>{{cite web|url=http://www.nasa.gov/centers/goddard/news/releases/2010/10-051.html|title=NASA – NASA Goddard Introduces the NASA Center for Climate Simulation|website=Nasa.gov|access-date=13 April 2016}}</ref><ref>{{cite web|last=Webster |first=Phil|title=Supercomputing the Climate: NASA's Big Data Mission| url=http://www.csc.com/cscworld/publications/81769/81773-supercomputing_the_climate_nasa_s_big_data_mission |work=CSC World|publisher=Computer Sciences Corporation|access-date=18 January 2013|url-status=dead| archive-url =https://web.archive.org/web/20130104220150/http://www.csc.com/cscworld/publications/81769/81773-supercomputing_the_climate_nasa_s_big_data_mission|archive-date=4 January 2013}}</ref>

−

* 谷歌的DNAStack对来自世界各地的基因数据的DNA样本进行编译和组织，以识别疾病和其他医疗缺陷。这些快速而精确的计算消除了任何“摩擦点”，或是众多研究DNA的科学和生物学专家中可能出现的人为错误。DNAStack是谷歌基因组学的一部分，它允许科学家使用谷歌搜索服务器上的大量样本资源来规模化社会实验，这些实验通常需要数年的时间。

+

* 谷歌的DNAStack对来自世界各地的基因数据的DNA样本进行编译和组织，以识别疾病和其他医疗缺陷。这些快速而精确的计算消除了任何“摩擦点”，或是众多研究DNA的科学和生物学专家中可能出现的人为错误。DNAStack是谷歌基因组学的一部分，它允许科学家使用谷歌搜索服务器上的大量样本资源来规模化社会实验，这些实验通常需要数年的时间。<ref>{{cite news| url=https://www.theglobeandmail.com/life/health-and-fitness/health/these-six-great-neuroscience-ideas-could-make-the-leap-from-lab-to-market/article21681731/|title=These six great neuroscience ideas could make the leap from lab to market|date=20 November 2014|work=[[The Globe and Mail]]|access-date=1 October 2016}}</ref><ref>{{cite web|url=https://cloud.google.com/customers/dnastack/|title=DNAstack tackles massive, complex DNA datasets with Google Genomics|publisher=Google Cloud Platform |access-date=1 October 2016}}</ref>

−

* 23andMe的DNA数据库包含全世界100多万人的基因信息。该公司探索在患者同意的情况下，将“匿名聚合基因数据”出售给其他研究人员和制药公司用于研究目的。杜克大学（Duke University）心理学和神经科学教授艾哈迈德·哈里里（Ahmad Hariri）自2009年以来一直在使用23andMe进行研究。他表示，该公司新服务的最重要方面是，它使科学家可以进行基因研究，而且成本相对较低。一项研究在23andMe的数据库中确定了15个与抑郁症相关的基因组位点，导致访问存储库的需求激增，23andMe在论文发表后的两周内提出了近20个访问抑郁症数据的请求。

+

* 23andMe的DNA数据库包含全世界100多万人的基因信息。<ref>{{cite web|title=23andMe – Ancestry|url=https://www.23andme.com/en-int/ancestry/| website=23andme.com| access-date=29 December 2016}}</ref>该公司探索在患者同意的情况下，将“匿名聚合基因数据”出售给其他研究人员和制药公司用于研究目的。<ref name=verge1>{{cite web|last1=Potenza|first1=Alessandra| title=23andMe wants researchers to use its kits, in a bid to expand its collection of genetic data|url=https://www.theverge.com/2016/7/13/12166960/23andme-genetic-testing-database-genotyping-research|website=The Verge|access-date=29 December 2016|date=13 July 2016}}</ref><ref>{{cite magazine| title=This Startup Will Sequence Your DNA, So You Can Contribute To Medical Research |url= https://www.fastcompany.com/3066775/innovation-agents/this-startup-will-sequence-your-dna-so-you-can-contribute-to-medical-resea|magazine=[[Fast Company]]|access-date=29 December 2016|date=23 December 2016}}</ref><ref>{{cite magazine|last1=Seife|first1=Charles|title=23andMe Is Terrifying, but Not for the Reasons the FDA Thinks|url=https://www.scientificamerican.com/article/23andme-is-terrifying-but-not-for-the-reasons-the-fda-thinks/|magazine=[[Scientific American]]|access-date=29 December 2016}}</ref><ref>{{cite web|last1=Zaleski|first1=Andrew|title=This biotech start-up is betting your genes will yield the next wonder drug|url=https://www.cnbc.com/2016/06/22/23andme-thinks-your-genes-are-the-key-to-blockbuster-drugs.html|publisher=CNBC|access-date=29 December 2016|date=22 June 2016}}</ref><ref>{{cite magazine|last1=Regalado|first1=Antonio|title=How 23andMe turned your DNA into a $1 billion drug discovery machine|url=https://www.technologyreview.com/s/601506/23andme-sells-data-for-drug-search/|magazine=[[MIT Technology Review]]|access-date=29 December 2016}}</ref>杜克大学（Duke University）心理学和神经科学教授艾哈迈德·哈里里（Ahmad Hariri）自2009年以来一直在使用23andMe进行研究。他表示，该公司新服务的最重要方面是，它使科学家可以进行基因研究，而且成本相对较低。<ref name=verge1/>一项研究在23andMe的数据库中确定了15个与抑郁症相关的基因组位点，导致访问存储库的需求激增，23andMe在论文发表后的两周内提出了近20个访问抑郁症数据的请求。<ref>{{cite web|title=23andMe reports jump in requests for data in wake of Pfizer depression study {{!}} FierceBiotech |url =http://www.fiercebiotech.com/it/23andme-reports-jump-requests-for-data-wake-pfizer-depression-study| website=fiercebiotech.com|access-date=29 December 2016}}</ref>

* 计算流体力学（CFD）和流体动力湍流研究产生了大量数据集。约翰·霍普金斯湍流数据库（JHTDB）包含超过350 TB的时空场，这些场来自各种湍流的直接数值模拟。使用下载平面模拟输出文件等传统方法很难共享此类数据。JHTDB中的数据可以使用“虚拟传感器”进行访问，其访问模式多种多样，从直接网络浏览器查询、通过在客户平台上执行的Matlab、Python、Fortran和C程序进行访问，到切断服务下载原始数据。这些数据已用于150多份科学出版物。

−

+

<br>

=== 运动 ===

薄荷

7,129

个编辑

更改

大数据 (查看源代码)

2022年3月6日 (日) 18:35的版本

导航菜单

搜索