更改

大数据 (查看源代码)

2022年2月7日 (一) 21:44的版本

添加3,005字节、 2022年2月7日 (一) 21:44

V1.0_20220207_翻译完成

第54行：第54行：

一些组织添加了“多样性”、“准确性”和其他各种“ v”来描述它，这个修订受到了一些行业权威的质疑。大数据 Vs 通常被称为“三个 Vs”、“四个 Vs”和“五个 Vs”。它们在数量、多样性、速度、准确性和价值等方面代表了大数据的特性。可变性通常作为大数据的附加质量被包括在内。

−

'''''【终译版】'''''一些组织增加了“多样性”、“准确性”和其他各种“V”开头的字母来描述它，但这一修订受到了一些行业权威的质疑。大数据的V通常被称为三V、四V和V。它们代表了大数据的大数量、多样性、速度、准确性和价值（volume, variety, velocity, veracity, and value）。可变性通常被视为大数据的额外属性。

+

'''''【终译版】'''''一些组织增加了“多样性”、“准确性”（"Variety", "Veracity"）和其他各种“V”开头的字母来描述它，但这一修订受到了一些行业权威的质疑。大数据的V通常被称为三V、四V和V。它们代表了大数据的大数量、多样性、速度、准确性和价值（volume, variety, velocity, veracity, and value）。可变性通常被视为大数据的额外属性。

A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and notes, "This represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of some of the guarantees and capabilities made by [[Relational database|Codd's relational model]]."<ref>{{Cite book|last=Fox|first=Charles|date=25 March 2018|title=Data Science for Transport| url=https://www.springer.com/us/book/9783319729527|publisher=Springer|isbn=9783319729527|series=Springer Textbooks in Earth Sciences, Geography and Environment}}</ref>

第265行：第265行：

==Technologies==

−

== ~~大数据技术~~ ==

+

== 技术 ==

A 2011 [[McKinsey & Company|McKinsey Global Institute]] report characterizes the main components and ecosystem of big data as follows:<ref name="McKinsey">{{cite journal | last1 = Manyika | first1 = James | first2 = Michael | last2 = Chui | first3 = Jaques | last3 = Bughin | first4 = Brad | last4 = Brown | first5 = Richard | last5 = Dobbs | first6 = Charles | last6 = Roxburgh | first7 = Angela Hung | last7 = Byers | title = Big Data: The next frontier for innovation, competition, and productivity | publisher = McKinsey Global Institute | date = May 2011 | url = https://www.mckinsey.com/~/media/mckinsey/business%20functions/mckinsey%20digital/our%20insights/big%20data%20the%20next%20frontier%20for%20innovation/mgi_big_data_full_report.pdf | access-date = 22 May 2021 }}</ref>

* Techniques for analyzing data, such as [[A/B testing]], [[machine learning]], and [[natural language processing]]

第330行：第330行：

==Applications==

+

== 应用 ==

[[File:2013-09-11 Bus wrapped with SAP Big Data parked outside IDF13 (9730051783).jpg|thumb|Bus wrapped with [[SAP AG|SAP]] big data parked outside [[Intel Developer Forum|IDF13]].|链接=Special:FilePath/2013-09-11_Bus_wrapped_with_SAP_Big_Data_parked_outside_IDF13_(9730051783).jpg]]

Big data has increased the demand of information management specialists so much so that [[Software AG]], [[Oracle Corporation]], [[IBM]], [[Microsoft]], [[SAP AG|SAP]], [[EMC Corporation|EMC]], [[Hewlett-Packard|HP]], and [[Dell]] have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.{{r|Economist}}

第454行：第456行：

'''''【终译版】'''''

−

大数据分析通过提供个性化医疗和处方分析、临床风险干预和预测分析、减少废物和护理变异性、患者数据的自动外部和内部报告、标准化医疗术语和患者登记，被用于医疗保健。有些领域的改进更具抱负，而不是实际实施。医疗保健系统内生成的数据水平并非微不足道。随着mHealth、eHealth和可穿戴技术的进一步采用，数据量将继续增加。这包括电子健康记录数据、成像数据、患者生成的数据、传感器数据和其他难以处理的数据。现在，这种环境更加需要关注数据和信息质量。“大数据通常意味着‘脏数据’，数据不准确的比例随着数据量的增长而增加。”在大数据范围内进行人体检查是不可能的，卫生服务部门迫切需要智能工具来准确、可信地控制和处理丢失的信息。虽然医疗保健领域的大量信息现在是电子化的，但它符合大数据的要求，因为大多数信息都是非结构化的，难以使用。在医疗保健领域使用大数据带来了重大的道德挑战，从个人权利、隐私和自主权的风险，到透明度和信任。

+

大数据分析通过提供个性化医疗和处方分析、临床风险干预和预测分析、减少废物和护理变异性、患者数据的自动外部和内部报告、标准化医疗术语和患者登记，大数据分析在医疗保健中得到了应用。有些领域的改进比实际执行的更具雄心壮志。医疗保健系统内生成的数据水平并非微不足道。随着移动健康、电子健康和可穿戴技术的广泛应用，数据量将继续增加。这包括电子健康记录数据、成像数据、患者生成的数据、传感器数据和其他难以处理的数据。现在，这种环境更加需要关注数据和信息质量。“大数据通常意味着‘脏数据’，数据不准确的比例随着数据量的增长而增加。”在大数据范围内进行人体检查是不可能的，卫生服务部门迫切需要智能工具来准确、可信地控制和处理丢失的信息。虽然医疗保健领域的大量信息现在是电子化的，但它符合大数据的要求，因为大多数信息都是非结构化的，难以使用。在医疗保健中使用大数据引发了重大的道德挑战，从个人权利、隐私和自主权的风险，到透明度和信任。

Big data in health research is particularly promising in terms of exploratory biomedical research, as data-driven analysis can move forward more quickly than hypothesis-driven research.<ref>{{Cite journal|last=Copeland|first=CS|date=Jul–Aug 2017|title=Data Driving Discovery|url=http://claudiacopeland.com/uploads/3/5/5/6/35560346/_hjno_data_driving_discovery_2pv.pdf|journal=Healthcare Journal of New Orleans|pages=22–27}}</ref> Then, trends seen in data analysis can be tested in traditional, hypothesis-driven follow up biological research and eventually clinical research.

第463行：第465行：

'''''【终译版】'''''

+

在探索性生物医学研究方面，健康研究中的大数据尤其有希望，因为数据驱动的分析比假设驱动的研究进展更快。然后，数据分析中的趋势可以在传统的、假设驱动的后续生物学研究和最终的临床研究中得到检验。

A related application sub-area, that heavily relies on big data, within the healthcare field is that of [[computer-aided diagnosis]] in medicine.

第477行：第481行：

These are just a few of the many examples where computer-aided diagnosis uses big data. For this reason, big data has been recognized as one of the seven key challenges that computer-aided diagnosis systems need to overcome in order to reach the next level of performance.

+

在医疗保健领域，一个相关的应用子领域，严重依赖于大数据，那就是医药电脑辅助诊断。例如，对于癫痫监测，通常每天创建5到10gb 的数据。同样，一张未压缩的乳房断层合成图像平均有450mb 的数据。这些只是电脑辅助诊断使用大数据的众多例子中的一小部分。基于这个原因，大数据已经被认为是电脑辅助诊断系统需要克服的7个关键挑战之一，以达到下一个性能水平。

'''''【终译版】'''''

−

医疗领域中一个严重依赖大数据的相关应用子领域是医学中的计算机辅助诊断。例如，对于癫痫监测，通常每天创建5到10GB的数据。类似地，一张未压缩的乳房断层合成图像的平均数据量为450 MB。这些只是计算机辅助诊断使用大数据的众多例子中的一小部分。因此，大数据被认为是计算机辅助诊断系统需要克服的七大关键挑战之一，以达到下一个性能水平。

+

医疗领域中一个严重依赖大数据的子领域是医学中的计算机辅助诊断。例如，对于癫痫监测，通常每天创建5到10GB的数据。类似地，一张未压缩的乳房断层合成图像的平均数据量为450 MB。这些只是计算机辅助诊断使用大数据的众多例子中的一小部分。因此，大数据被认为是计算机辅助诊断系统需要克服的七大关键挑战之一。

===Education===

+

=== 教育 ===

A [[McKinsey & Company|McKinsey Global Institute]] study found a shortage of 1.5 million highly trained data professionals and managers<ref name="McKinsey"/> and a number of universities<ref>{{cite web

| url=https://www.forbes.com/sites/jmaureenhenderson/2013/07/30/degrees-in-big-data-fad-or-fast-track-to-career-success/

第503行：第510行：

'''''【终译版】'''''

−

麦肯锡全球研究所发现，150万名训练有素的数据专家和管理人员短缺，包括田纳西大学和加州大学伯克利分校在内的一些大学已经建立了硕士课程来满足这一需求。私营新兵训练营也开发了一些项目来满足这一需求，包括数据孵化器等免费项目或大会等付费项目。在营销的特定领域，Wedel和Kannan强调的一个问题是，营销有几个子领域（例如广告、促销、产品开发、品牌推广），它们都使用不同类型的数据。

+

麦肯锡全球研究所发现，，受过高等培训的数据专业人员和管理人员的需求存在150万人的短缺，包括田纳西大学和加州大学伯克利分校在内的一些大学已经建立了硕士课程来满足这一需求。私营培训班也开发了一些项目来满足这一需求，包括数据孵化器等免费项目或大会等付费项目。在营销的特定领域，Wedel和Kannan强调的一个问题是，营销有几个子领域（例如广告、促销、产品开发、品牌推广），它们都使用不同类型的数据。

===Media===

+

=== 媒体 ===

To understand how the media uses big data, it is first necessary to provide some context into the mechanism used for media process. It has been suggested by Nick Couldry and Joseph Turow that practitioners in media and advertising approach big data as many actionable points of information about millions of individuals. The industry appears to be moving away from the traditional approach of using specific media environments such as newspapers, magazines, or television shows and instead taps into consumers with technologies that reach targeted people at optimal times in optimal locations. The ultimate aim is to serve or convey, a message or content that is (statistically speaking) in line with the consumer's mindset. For example, publishing environments are increasingly tailoring messages (advertisements) and content (articles) to appeal to consumers that have been exclusively gleaned through various [[data-mining]] activities.<ref>{{cite journal|last1=Couldry|first1=Nick|last2=Turow|first2=Joseph|title=Advertising, Big Data, and the Clearance of the Public Realm: Marketers' New Approaches to the Content Subsidy| journal=International Journal of Communication|date=2014|volume=8|pages=1710–1726}}</ref>

* Targeting of consumers (for advertising by marketers)<ref>{{cite web|url=https://ishti.org/2018/04/15/why-digital-advertising-agencies-suck-at-acquisition-and-are-in-dire-need-of-an-ai-assisted-upgrade/|title=Why Digital Advertising Agencies Suck at Acquisition and are in Dire Need of an AI Assisted Upgrade|website=Ishti.org|access-date=15 April 2018|date=15 April 2018|archive-date=12 February 2019|archive-url=https://web.archive.org/web/20190212174722/https://ishti.org/2018/04/15/why-digital-advertising-agencies-suck-at-acquisition-and-are-in-dire-need-of-an-ai-assisted-upgrade/|url-status=dead}}</ref>

第520行：第529行：

* 数据捕捉

* 数据新闻: 出版商和记者使用大数据工具提供独特和创新的见解和信息图表。

+

'''''【终译版】'''''

+

为了理解媒体如何使用大数据，首先需要为媒体处理所使用的机制提供一些场景。尼克·库尔德利（Nick Couldry）和约瑟夫·图罗（Joseph Turow）曾建议，媒体和广告从业者在处理大数据时，应尽可能多地处理数百万个人的可操作信息点。该行业似乎正在摆脱使用特定媒体环境（如报纸、杂志或电视节目）的传统方式，转而利用技术在最佳时间、最佳地点接触目标人群，以吸引消费者。最终目的是提供或传达符合消费者心态的信息或内容（在统计学上）。例如，发布环境越来越多地定制消息（广告）和内容（文章），以吸引专门通过各种数据挖掘活动收集的消费者。

−

+

* 以消费者为目标（针对营销人员的广告）。

−

'''''【终译版】'''''为了理解媒体如何使用大数据，首先需要为媒体处理所使用的机制提供一些上下文。尼克·库尔德利（Nick Couldry）和约瑟夫·图罗（Joseph Turow）曾建议，媒体和广告从业者在处理大数据时，应尽可能多地处理数百万个人的可操作信息点。该行业似乎正在摆脱使用特定媒体环境（如报纸、杂志或电视节目）的传统方式，转而利用技术在最佳时间、最佳地点接触目标人群，以吸引消费者。最终目的是提供或传达（从统计学上讲）符合消费者心态的信息或内容。例如，发布环境越来越多地定制消息（广告）和内容（文章），以吸引专门通过各种数据挖掘活动收集的消费者。

+

* 数据捕获。

−

+

* 数据新闻：出版商和记者使用大数据工具提供独特和创新的见解和信息图表。

−

~~以消费者为目标（针对营销人员的广告）~~

−

~~数据捕获~~

−

数据新闻：出版商和记者使用大数据工具提供独特和创新的见解和信息图表。

[[Channel 4]], the British [[Public service broadcasting in the United Kingdom|public-service]] television broadcaster, is a leader in the field of big data and [[data analysis]].<ref>{{cite web|url=https://www.ibc.org/tech-advances/big-data-and-analytics-c4-and-genius-digital/1076.article |title=Big data and analytics: C4 and Genius Digital|website=Ibc.org |access-date=8 October 2017}}</ref>

第540行：第546行：

===Insurance===

+

=== 保险 ===

Health insurance providers are collecting data on social "determinants of health" such as food and [[Television consumption|TV consumption]], marital status, clothing size, and purchasing habits, from which they make predictions on health costs, in order to spot health issues in their clients. It is controversial whether these predictions are currently being used for pricing.<ref>{{Cite web|author=Marshall Allen|url=https://www.propublica.org/article/health-insurers-are-vacuuming-up-details-about-you-and-it-could-raise-your-rates| title=Health Insurers Are Vacuuming Up Details About You – And It Could Raise Your Rates|website=www.propublica.org|date=17 July 2018|access-date=21 July 2018}}</ref>

第546行：第554行：

= = = = 医疗保险提供者正在收集关于诸如食物和电视消费、婚姻状况、衣服尺寸和购买习惯等社会”健康决定因素”的数据，从而对医疗费用进行预测，以便发现客户的健康问题。这些预测目前是否被用于定价还存在争议。

−

'''''【终译版】'''''健康保险提供商正在收集有关社会“健康决定因素”的数据，如食品和电视消费、婚姻状况、服装尺寸和购买习惯，并根据这些数据预测健康成本，以便发现客户的健康问题。目前，这些预测是否用于定价还存在争议。

+

'''''【终译版】'''''健康保险提供商正在收集有关社会“健康决定因素”的数据，如食品和电视消费、婚姻状况、服装尺寸和购买习惯，并根据这些数据预测健康成本，以便发现客户的健康问题。目前，这些预测是否可被用于定价还存在争议。

===Internet of things (IoT)===

+

=== '''物联网''' ===

第559行：第569行：

'''''【终译版】'''''

−

大数据和物联网协同工作。从物联网设备提取的数据提供了设备间连接的映射。媒体行业、公司和政府已经使用这种映射来更准确地定位受众并提高媒体效率。物联网也越来越多地被用作收集感官数据的手段，这种感官数据已被用于医疗、制造和运输环境。

+

大数据和物联网协同工作，从物联网设备提取的数据提供了设备间连接的映射。媒体行业、公司和政府已经使用这种映射来更准确地定位受众并提高媒体效率。物联网也越来越多地被用作收集感官数据的手段，这种感官数据已被用于医疗、制造和运输环境。

[[Kevin Ashton]], the digital innovation expert who is credited with coining the term,<ref>{{cite web|url=http://www.rfidjournal.com/articles/view?4986|title=That Internet Of Things Thing.}}</ref> defines the Internet of things in this quote: "If we had computers that knew everything there was to know about things—using data they gathered without any help from us—we would be able to track and count everything, and greatly reduce waste, loss, and cost. We would know when things needed replacing, repairing, or recalling, and whether they were fresh or past their best."

第570行：第580行：

===Information technology===

+

=== 信息技术 ===

Especially since 2015, big data has come to prominence within [[business operations]] as a tool to help employees work more efficiently and streamline the collection and distribution of [[information technology]] (IT). The use of big data to resolve IT and data collection issues within an enterprise is called [[IT operations analytics]] (ITOA).<ref name="ITOA1">{{cite web|last1=Solnik|first1=Ray |title=The Time Has Come: Analytics Delivers for IT Operations |url =http://www.datacenterjournal.com/time-analytics-delivers-operations/|website=Data Center Journal| access-date=21 June 2016}}</ref> By applying big data principles into the concepts of [[machine intelligence]] and deep computing, IT departments can predict potential issues and prevent them.<ref name="ITOA1" /> ITOA businesses offer platforms for [[systems management]] that bring [[data silos]] together and generate insights from the whole of the system rather than from isolated pockets of data.

第576行：第588行：

= = = 信息技术 = = = 特别是自2015年以来，大数据作为帮助雇员提高工作效率和简化信息技术的收集和分发的一种工具，在企业运作中日益受到重视。使用大数据来解决企业内部的 IT 和数据收集问题被称为 IT 操作分析(ITOA)。通过将大数据原理应用到机器智能和深度计算的概念中，IT 部门可以预测潜在的问题并预防它们。ITOA 企业提供系统管理平台，将数据竖井集中在一起，从整个系统而不是从孤立的数据块中产生见解。

−

'''''【终译版】'''''特别是自2015年以来，大数据作为一种帮助员工更高效地工作并简化信息技术（IT）收集和分发的工具，在企业运营中日益突出。利用大数据解决企业内部的IT和数据收集问题称为IT运营分析（ITOA）。通过将大数据原理应用到机器智能和深度计算的概念中，IT部门可以预测潜在问题并加以预防。ITOA企业提供系统管理平台，将数据仓库整合在一起，从整个系统而不是从孤立的数据包中产生见解。

+

'''''【终译版】'''''特别是自2015年以来，大数据作为一种帮助员工更高效地工作并简化信息技术收集和分发的工具，在企业运营中日益突出。利用大数据解决企业内部的IT和数据收集问题称为IT运营分析（ITOA）。通过将大数据原理应用到机器智能和深度计算的概念中，IT部门可以预测潜在问题并加以预防。ITOA企业提供系统管理平台，将数据仓库整合在一起，从整个系统而不是从孤立的数据包中产生见解。

==Case studies==

第588行：第600行：

* The Integrated Joint Operations Platform (IJOP, 一体化联合作战平台) is used by the government to monitor the population, particularly Uyghurs. Biometrics, including DNA samples, are gathered through a program of free physicals.

*By 2020, China plans to give all its citizens a personal "social credit" score based on how they behave. The Social Credit System, now being piloted in a number of Chinese cities, is considered a form of mass surveillance which uses big data analysis technology.

−

~~====China====~~

* The Integrated Joint Operations Platform (IJOP, 一体化联合作战平台) is used by the government to monitor the population, particularly Uyghurs.生物测定学，包括 DNA 样本，是通过一个免费的体检程序收集的。

第623行：第633行：

* Data on prescription drugs: by connecting origin, location and the time of each prescription, a research unit was able to exemplify and examine the considerable delay between the release of any given drug, and a UK-wide adaptation of the National Institute for Health and Care Excellence guidelines. This suggests that new or most up-to-date drugs take some time to filter through to the general patient.

* Joining up data: a local authority blended data about services, such as road gritting rotas, with services for people at risk, such as Meals on Wheels. The connection of data allowed the local authority to avoid any weather-related delay.

−

* 关于处方药的数据: 通过将处方的来源、地点和时间联系起来，一个研究单位能够举例说明和审查任何特定药物发放之间相当长的延误，以及全英国对国家保健和优质护理研究所准则的调整。这表明，新的或最新的药物需要一些时间过滤到普通病人。

第642行：第651行：

* 大数据分析在巴拉克•奥巴马(Barack Obama)成功赢得2012年连任竞选中发挥了重要作用。

* 美国联邦政府拥有世界上功能最强大的十台超级计算机中的五台。犹他数据中心由美国国家安全局建造。完工后，该设施将能够处理国家安全局通过互联网收集的大量信息。存储空间的确切数量还不得而知，但最近有消息称，存储空间大约为几艾字节。这引起了对所收集数据的匿名性的安全担忧。

+

'''''【终译版】'''''

+

=== 国家 ===

+

==== 中国 ====

+

* 一体化联合作战平台（The Integrated Joint Operations Platform，IJOP）被政府用来监控人口，尤其是维吾尔族。通过免费体检项目收集生物特征（包括DNA样本）。

+

* 到2020年，中国计划根据所有公民的行为给他们个人“社会信用”评分。目前正在中国多个城市试点的社会信用体系被认为是一种使用大数据分析技术的大规模监控。

+

==== 印度 ====

+

* 为了赢得2014年印度大选，印度人民党尝试了大数据分析。

+

* 印度政府使用多种技术来确定印度选民对政府行动的反应，以及对政策的看法。

+

==== 以色列 ====

+

* 通过GlucoMe的大数据解决方案创建了个性化的糖尿病治疗。

+

==== 英国 ====

+

大数据在公共服务中的应用示例：

+

* 处方药数据：通过连接每种处方药的产地、地点和时间，研究单位能够检查任何给定药物的释放与英国全国卫生保健研究所（National Institute for Health and Care Excellence）指南在全英范围内的调整之间存在的延迟。以往，新的或最新的药物需要一些时间才能渗透到普通患者身上。

+

* 整合数据：地方当局将道路沙砾摊等服务的数据与为高危人群提供的服务（如轮上用餐）混合在一起。数据的连接使地方当局得以避免天气导致的延迟。

+

==== 美国 ====

+

* 2012年，奥巴马政府宣布了大数据研发计划，以探索如何利用大数据解决政府面临的重要问题。该计划由分布在六个部门的84个不同的大数据项目组成。

+

* 大数据分析在奥巴马2012年成功连任竞选中发挥了重要作用。

+

* 美国联邦政府拥有世界上最强大的十台超级计算机中的五台。

+

* 犹他州数据中心由美国国家安全局建造。完成后，该设施将能够处理NSA通过互联网收集的大量信息。确切的存储空间数量不得而知，但最近的消息来源称，存储空间大约为几EB。这对所收集数据的匿名性提出了安全担忧。

===Retail===

+

=== '''零售''' ===

* [[Walmart]] handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data—the equivalent of 167 times the information contained in all the books in the US [[Library of Congress]].{{r|Economist}}

* [[Windermere Real Estate]] uses location information from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day.<ref>{{cite news| last=Wingfield |first=Nick |url= http://bits.blogs.nytimes.com/2013/03/12/predicting-commutes-more-accurately-for-would-be-home-buyers/ |title=Predicting Commutes More Accurately for Would-Be Home Buyers |work=The New York Times |date=12 March 2013 |access-date=21 July 2013}}</ref>

第656行：第697行：

* FICO 卡检测系统保护全球账户。

'''''【终译版】'''''

+

* 沃尔玛每小时处理超过100万笔客户交易，这些交易被导入数据库，据估计包含超过2.5 PB（2560 TB）的数据，相当于美国国会图书馆所有书籍所包含信息的167倍。

+

* Windermere Real Estate利用近1亿名司机的位置信息，帮助新购房者确定一天中不同时间上下班的典型驾驶时间。

+

* FICO卡检测系统保护世界各地的账户。

===Science===

第693行：第738行：

'''''【终译版】'''''

−

~~大型强子对撞机的实验代表了大约1~~.5亿个传感器每秒传送4000万次数据。每秒有近6亿次碰撞。在过滤并避免记录超过99.99995%的流之后，每秒有1000次感兴趣的碰撞。

+

* 大型强子对撞机的实验有着大约1.5亿个传感器每秒传送4000万次数据。每秒有近6亿次碰撞。在过滤并避免记录超过99.99995%的流之后，每秒有1000次感兴趣的碰撞。

+

** 因此，仅使用不到0.001%的传感器流数据，所有四个LHC实验的数据流在复制前代表25 PB的年速率。复制后，这将变成近200 PB。

+

** 如果所有传感器数据都记录在LHC中，数据流将非常难以处理。在复制之前，数据流的年速率将超过1.5亿PB，即每天近500 EB。从长远来看，这个数字相当于每天500五百万（5×1020）字节，几乎是世界上所有其他数据源总和的200倍。

+

* 平方公里阵列（Square Kilometre Array）是一个由数千根天线组成的射电望远镜。预计将于2024年投入使用。这些天线的总容量预计为14 EB，每天存储1 PB。它被认为是有史以来最雄心勃勃的科学项目之一。

+

* 斯隆数字天空测量（SDSS）在2000年开始收集天文数据时，它在最初几周收集的数据比之前天文学史上收集的所有数据都多。SDS以每晚约200 GB的速度运行，已经积累了超过140 TB的信息。当SDSS的后继者大型天气观测望远镜在2020年上线时，其设计者预计它将每五天获取如此数量的数据。

+

* 解码人类基因组最初需要10年的时间；现在不到一天就可以实现。在过去十年中，DNA测序仪将测序成本除以10000，比摩尔定律预测的成本低100倍。

+

* 美国国家航空航天局气候模拟中心（NCCS）在探索超级计算集群上存储了32 PB的气候观测和模拟数据。

+

* 谷歌的DNAStack对来自世界各地的基因数据的DNA样本进行编译和组织，以识别疾病和其他医疗缺陷。这些快速而精确的计算消除了任何“摩擦点”，或是众多研究DNA的科学和生物学专家中可能出现的人为错误。DNAStack是谷歌基因组学的一部分，它允许科学家使用谷歌搜索服务器上的大量样本资源来规模化社会实验，这些实验通常需要数年的时间。

+

* 23andMe的DNA数据库包含全世界100多万人的基因信息。该公司探索在患者同意的情况下，将“匿名聚合基因数据”出售给其他研究人员和制药公司用于研究目的。杜克大学（Duke University）心理学和神经科学教授艾哈迈德·哈里里（Ahmad Hariri）自2009年以来一直在使用23andMe进行研究。他表示，该公司新服务的最重要方面是，它使科学家可以进行基因研究，而且成本相对较低。一项研究在23andMe的数据库中确定了15个与抑郁症相关的基因组位点，导致访问存储库的需求激增，23andMe在论文发表后的两周内提出了近20个访问抑郁症数据的请求。

+

* 计算流体力学（CFD）和流体动力湍流研究产生了大量数据集。约翰·霍普金斯湍流数据库（JHTDB）包含超过350 TB的时空场，这些场来自各种湍流的直接数值模拟。使用下载平面模拟输出文件等传统方法很难共享此类数据。JHTDB中的数据可以使用“虚拟传感器”进行访问，其访问模式多种多样，从直接网络浏览器查询、通过在客户平台上执行的Matlab、Python、Fortran和C程序进行访问，到切断服务下载原始数据。这些数据已用于150多份科学出版物。

−

~~因此，仅使用不到0.001%的传感器流数据，所有四个LHC实验的数据流在复制前代表25 PB的年速率（）。复制后，这将变成近200 PB。~~

+

===Sports===

−

如果所有传感器数据都记录在LHC中，数据流将非常难以处理。在复制之前，数据流的年速率将超过1.5亿PB，即每天近500 EB。从长远来看，这个数字相当于每天500五百万（5×1020）字节，几乎是世界上所有其他数据源总和的200倍。

−

平方公里的阵列是一个由数千根天线组成的射电望远镜。预计将于2024年投入使用。这些天线的总容量预计为14 EB，每天存储1 PB。它被认为是有史以来最雄心勃勃的科学项目之一。

−

斯隆数字天空测量（SDSS）在2000年开始收集天文数据时，它在最初几周收集的数据比之前天文学史上收集的所有数据都多。SDS以每晚约200 GB的速度运行，已经积累了超过140 TB的信息。当SDSS的后继者大型天气观测望远镜在2020年上线时，其设计者预计它将每五天获取如此数量的数据。

−

~~解码人类基因组最初需要10年的时间；现在不到一天就可以实现。在过去十年中，DNA测序仪将测序成本除以10000，比摩尔定律预测的成本降低100倍。~~

−

~~美国国家航空航天局气候模拟中心（NCCS）在探索超级计算集群上存储了32 PB的气候观测和模拟数据。~~

−

谷歌的DNAStack对来自世界各地的基因数据的DNA样本进行编译和组织，以识别疾病和其他医疗缺陷。这些快速而精确的计算消除了任何“摩擦点”，或是众多研究DNA的科学和生物学专家中可能出现的人为错误。DNAStack是谷歌基因组学的一部分，它允许科学家使用谷歌搜索服务器上的大量资源样本来扩展通常需要数年时间的社会实验。

−

23andme的DNA数据库包含全世界100多万人的基因信息。该公司探索在患者同意的情况下，将“匿名聚合基因数据”出售给其他研究人员和制药公司用于研究目的。杜克大学（Duke University）心理学和神经科学教授艾哈迈德·哈里里（Ahmad Hariri）自2009年以来一直在使用23andMe进行研究。他表示，该公司新服务的最重要方面是，它使科学家可以进行基因研究，而且成本相对较低。一项研究在23andMe的数据库中确定了15个与抑郁症相关的基因组位点，导致访问存储库的需求激增，23andMe在论文发表后的两周内提出了近20个访问抑郁症数据的请求。

−

计算流体力学（CFD）和流体动力湍流研究产生了大量数据集。约翰·霍普金斯湍流数据库（JHTDB）包含超过350 TB的时空场，这些场来自各种湍流的直接数值模拟。使用下载平面模拟输出文件等传统方法很难共享此类数据。JHTDB中的数据可以使用“虚拟传感器”进行访问，其访问模式多种多样，从直接网络浏览器查询、通过在客户平台上执行的Matlab、Python、Fortran和C程序进行访问，到切断服务下载原始数据。这些数据已用于150多份科学出版物。

+

=== 运动 ===

−

===~~Sports~~===

Big data can be used to improve training and understanding competitors, using sport sensors. It is also possible to predict winners in a match using big data analytics.<ref>{{cite web|url=http://www.itweb.co.za/index.php?option=com_content&view=article&id=147241|title=Data scientists predict Springbok defeat |author=Admire Moyo| work=itweb.co.za|date=23 October 2015 |access-date=12 December 2015}}</ref>

Future performance of players could be predicted as well. Thus, players' value and salary is determined by data collected throughout the season.<ref>{{cite web|url=http://www.itweb.co.za/index.php?option=com_content&view=article&id=147852|title= Predictive analytics, big data transform sports

第737行：第775行：

===Technology===

+

=== 技术 ===

* [[eBay.com]] uses two [[data warehouse]]s at 7.5 [[petabytes]] and 40PB as well as a 40PB [[Hadoop]] cluster for search, consumer recommendations, and merchandising.<ref>{{cite web | last=Tay | first=Liz |url=http://www.itnews.com.au/news/inside-ebay8217s-90pb-data-warehouse-342615 | title=Inside eBay's 90PB data warehouse | publisher=ITNews | access-date=12 February 2016}}</ref>

* [[Amazon.com]] handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and {{as of|2005|lc=on}} they had the world's three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.<ref>{{cite web|last=Layton |first=Julia |url= http://money.howstuffworks.com/amazon1.htm | title=Amazon Technology |date=25 January 2006 |publisher= Money.howstuffworks.com |access-date=5 March 2013}}</ref>

第753行：第793行：

'''''【终译版】'''''

−

~~易趣网。com使用两个7~~.5 PB和40PB的数据仓库，以及一个40PB的Hadoop集群来进行搜索、消费者推荐和商品销售。

+

易趣网使用两个7.5 PB和40PB的数据仓库，以及一个40PB的Hadoop集群来进行搜索、消费者推荐和商品销售。

−

亚马逊。com每天处理数以百万计的后端操作，以及来自50多万第三方卖家的查询。保持亚马逊运行的核心技术是基于Linux的，他们拥有世界上三大Linux数据库，容量分别为7.8 TB、18.5 TB和24.7 TB。

+

亚马逊每天处理数以百万计的后端操作，以及来自50多万第三方卖家的查询。保持亚马逊运行的核心技术是基于Linux的，他们拥有世界上三大Linux数据库，容量分别为7.8 TB、18.5 TB和24.7 TB。

Facebook从其用户群中处理500亿张照片，Facebook每月有20亿活跃用户。

第762行：第802行：

===COVID-19===

+

=== 新冠大流行 ===

During the [[COVID-19 pandemic]], big data was raised as a way to minimise the impact of the disease. Significant applications of big data included minimising the spread of the virus, case identification and development of medical treatment.<ref>{{cite journal |last1=Haleem |first1=Abid |last2=Javaid |first2=Mohd |last3=Khan |first3=Ibrahim |last4=Vaishya |first4=Raju |title=Significant Applications of Big Data in COVID-19 Pandemic |journal=Indian Journal of Orthopaedics |date=2020 |volume=54 |issue=4 |pages=526–528 |doi=10.1007/s43465-020-00129-z |pmid=32382166 |pmc=7204193 }}</ref>

第787行：第829行：

2014年3月，美国工程教育学会演示了大数据中的加密搜索和集群形成。由麻省理工学院计算机科学和人工智能实验室和 UNH 研究小组的 Amir Esmailpour 共同致力于解决大数据的挑战，他们研究了大数据的关键特征，即集群的形成及其相互联系。他们重点讨论了大数据的安全性以及该术语的方向，即通过提供技术中的原始定义和实时示例，在云界面上以加密形式存在不同类型的数据。此外，他们还提出了一种识别编码技术的方法，以便对加密文本进行快速搜索，从而加强大数据的安全性。

−

'''''【终译版】'''''2014年3月，美国工程教育学会（American Society of Engineering Education）展示了大数据中的加密搜索和集群形成。麻省理工学院计算机科学和人工智能实验室的Gautam Siwach和UNH研究小组的Amir Esmailpour致力于应对大数据的挑战，他们研究了大数据的关键特征，如集群的形成及其相互关联。他们通过提供技术中的原始定义和实时示例，重点关注大数据的安全性，以及该术语在云接口以加密形式存在不同类型数据的方向。此外，他们还提出了一种识别编码技术的方法，以加快对加密文本的搜索，从而增强大数据的安全性

+

'''''【终译版】'''''2014年3月，美国工程教育学会（American Society of Engineering Education）展示了大数据中的加密搜索和集群形成。麻省理工学院计算机科学和人工智能实验室的Gautam Siwach和UNH研究小组的Amir Esmailpour致力于解决大数据的挑战，他们研究了大数据的关键特征，如集群的形成及其相互关联。他们通过提供技术中的原始定义和实时示例，重点关注大数据的安全性，以及该术语在云接口以加密形式存在不同类型数据的方向。此外，他们还提出了一种识别编码技术的方法，以加快对加密文本的搜索，从而增强大数据的安全性。

In March 2012, The White House announced a national "Big Data Initiative" that consisted of six federal departments and agencies committing more than $200 million to big data research projects.<ref>{{cite web|title=Obama Administration Unveils "Big Data" Initiative:Announces $200 Million in New R&D Investments| url=https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf |url-status =live| archive-url =https://web.archive.org/web/20170121233309/https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf |via=[[NARA|National Archives]]|work=[[Office of Science and Technology Policy]]|archive-date=21 January 2017}}</ref>

第803行：第845行：

这一举措包括美国国家科学基金会”计算机探险”项目，该项目将在五年内向加州大学伯克利分校的 AMPLab 提供1000万美元的资助。美国国防部高级研究计划局也从美国国防部高级研究计划局和十几个工业赞助商那里获得了资金，并利用大数据来解决从预测交通堵塞到抗击癌症的一系列问题。

−

'''''【终译版】'''''该举措包括一个国家科学基金会“计算远征”超过1000万美元赠款超过五年的AcAMAB在加利福尼亚大学，伯克利。AMPLab还从DARPA和十几家行业赞助商那里获得资金，并利用大数据解决从预测交通拥堵到抗击癌症等一系列问题。

+

'''''【终译版】'''''该举措包括一个国家科学基金会“计算远征”，该项目将在五年内向加州大学伯克利分校的 AMPLab 提供1000万美元的资助。AMPLab还从DARPA和十几家行业赞助商那里获得资金，并利用大数据解决从预测交通拥堵到抗击癌症等一系列问题。

The White House Big Data Initiative also included a commitment by the Department of Energy to provide $25 million in funding over five years to establish the Scalable Data Management, Analysis and Visualization (SDAV) Institute,<ref>{{cite web|title=Secretary Chu Announces New Institute to Help Scientists Improve Massive Data Set Research on DOE Supercomputers |publisher=energy.gov |url=http://energy.gov/articles/secretary-chu-announces-new-institute-help-scientists-improve-massive-data-set-research-doe}}</ref> led by the Energy Department's [[Lawrence Berkeley National Laboratory]]. The SDAV Institute aims to bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the department's supercomputers.

第843行：第885行：

在滑铁卢大学斯特拉特福德校区加拿大开放数据体验(CODE)启发日上，与会者展示了如何使用数据可视化数据可以增加对大数据集的理解和吸引力，并向世界传达他们的故事。

−

'''''【终译版】'''''在滑铁卢大学斯特拉特福校园加拿大开放数据体验（代码）启示日，与会者演示了如何使用数据可视化可以增加对大数据集的理解和吸引力，并向世界传达他们的故事。

+

'''''【终译版】'''''在滑铁卢大学斯特拉特福校园加拿大开放数据体验（CODE）启示日，与会者演示了如何使用数据可视化可以增加对大数据集的理解和吸引力，并向世界传达他们的故事。

[[Computational social science|Computational social sciences]] – Anyone can use application programming interfaces (APIs) provided by big data holders, such as Google and Twitter, to do research in the social and behavioral sciences.<ref name=pigdata>{{cite journal|last=Reips|first=Ulf-Dietrich|author2=Matzat, Uwe |title=Mining "Big Data" using Big Data Services |journal=International Journal of Internet Science |year=2014|volume=1|issue=1|pages=1–8 | url=http://www.ijis.net/ijis9_1/ijis9_1_editorial_pre.html}}</ref> Often these APIs are provided for free.<ref name="pigdata" /> [[Tobias Preis]] et al. used [[Google Trends]] data to demonstrate that Internet users from countries with a higher per capita gross domestic products (GDPs) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behaviors and real-world economic indicators.<ref>{{cite journal | vauthors = Preis T, Moat HS, Stanley HE, Bishop SR | title = Quantifying the advantage of looking forward | journal = Scientific Reports | volume = 2 | pages = 350 | year = 2012 | pmid = 22482034 | pmc = 3320057 | doi = 10.1038/srep00350 | bibcode = 2012NatSR...2E.350P }}</ref><ref>{{cite news | url=https://www.newscientist.com/article/dn21678-online-searches-for-future-linked-to-economic-success.html | title=Online searches for future linked to economic success |first=Paul |last=Marks |work=New Scientist | date=5 April 2012 | access-date=9 April 2012}}</ref><ref>{{cite news | url=https://arstechnica.com/gadgets/news/2012/04/google-trends-reveals-clues-about-the-mentality-of-richer-nations.ars | title=Google Trends reveals clues about the mentality of richer nations |first=Casey |last=Johnston |work=Ars Technica | date=6 April 2012 | access-date=9 April 2012}}</ref> The authors of the study examined Google queries logs made by ratio of the volume of searches for the coming year (2011) to the volume of searches for the previous year (2009), which they call the "[[future orientation index]]".<ref>{{cite web | url = http://www.tobiaspreis.de/bigdata/future_orientation_index.pdf | title = Supplementary Information: The Future Orientation Index is available for download | author = Tobias Preis | date = 24 May 2012 | access-date = 24 May 2012}}</ref> They compared the future orientation index to the per capita GDP of each country, and found a strong tendency for countries where Google users inquire more about the future to have a higher GDP.

第851行：第893行：

计算社会科学——任何人都可以使用大数据持有者(如谷歌和 Twitter)提供的应用程序编程接口(api)进行社会和行为科学研究。这些 api 通常是免费提供的。托拜厄斯 · 普雷斯等。使用谷歌趋势数据证明，来自人均国内生产总值(gdp)较高国家的互联网用户更有可能搜索有关未来的信息，而不是有关过去的信息。研究结果表明，在线行为和现实世界的经济指标之间可能存在某种联系。这项研究的作者审查了谷歌的查询日志，这些日志是根据下一年(2011年)的搜索量与上一年(2009年)的搜索量之比制作的，他们称之为“未来方向索引”。他们将未来方向指数与每个国家的人均 GDP 进行了比较，发现谷歌用户询问更多关于未来的信息的国家有一个更高的 GDP 趋势。

−

'''''【终译版】'''''计算社会科学——任何人都可以使用谷歌和Twitter等大数据持有者提供的应用程序编程接口（API）进行社会和行为科学研究。这些API通常是免费提供的。Tobias Preis等人利用谷歌趋势数据证明，来自人均国内生产总值（GDP）较高国家的互联网用户搜索未来信息的可能性大于搜索过去信息的可能性。研究结果表明，在线行为与现实世界的经济指标之间可能存在联系。这项研究的作者根据下一年（2011年）的搜索量与上一年（2009年）的搜索量之比来检查谷歌的查询日志，他们称之为“未来定位指数”。他们将未来导向指数与每个国家的人均GDP进行了比较，发现谷歌用户查询更多关于未来的国家有更高GDP的强烈趋势。

+

'''''【终译版】'''''计算社会科学——任何人都可以使用谷歌和Twitter等大数据持有者提供的应用程序编程接口（API）进行社会和行为科学研究。这些API通常是免费提供的。Tobias Preis等人利用谷歌趋势数据证明，来自人均国内生产总值（GDP）较高国家的互联网用户搜索未来信息的可能性大于搜索过去信息的可能性。研究结果表明，在线行为与现实世界的经济指标之间可能存在联系。这项研究的作者根据下一年（2011年）的搜索量与上一年（2009年）的搜索量之比来检查谷歌的查询日志，他们称之为“未来方向指数”。他们将未来导向指数与每个国家的人均GDP进行了比较，发现谷歌用户查询更多关于未来的国家有更高GDP的强烈趋势。

[[Tobias Preis]] and his colleagues Helen Susannah Moat and [[H. Eugene Stanley]] introduced a method to identify online precursors for stock market moves, using trading strategies based on search volume data provided by Google Trends.<ref>{{cite journal | url =http://www.nature.com/news/counting-google-searches-predicts-market-movements-1.12879 | title=Counting Google searches predicts market movements | author=Philip Ball | journal=Nature | date=26 April 2013 | doi=10.1038/nature.2013.12879 | s2cid=167357427 | access-date=9 August 2013| author-link=Philip Ball }}</ref> Their analysis of [[Google]] search volume for 98 terms of varying financial relevance, published in ''[[Scientific Reports]]'',<ref>{{cite journal | vauthors = Preis T, Moat HS, Stanley HE | title = Quantifying trading behavior in financial markets using Google Trends | journal = Scientific Reports | volume = 3 | pages = 1684 | year = 2013 | pmid = 23619126 | pmc = 3635219 | doi = 10.1038/srep01684 | bibcode = 2013NatSR...3E1684P }}</ref> suggests that increases in search volume for financially relevant search terms tend to precede large losses in financial markets.<ref>{{cite news | url=http://bits.blogs.nytimes.com/2013/04/26/google-search-terms-can-predict-stock-market-study-finds/ | title= Google Search Terms Can Predict Stock Market, Study Finds | author=Nick Bilton | work=[[The New York Times]] | date=26 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite magazine | url=http://business.time.com/2013/04/26/trouble-with-your-investment-portfolio-google-it/ | title=Trouble With Your Investment Portfolio? Google It! | author=Christopher Matthews | magazine=[[Time (magazine)|Time]] | date=26 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite journal | url= http://www.nature.com/news/counting-google-searches-predicts-market-movements-1.12879 | title=Counting Google searches predicts market movements | author=Philip Ball |journal=[[Nature (journal)|Nature]] | date=26 April 2013 | doi=10.1038/nature.2013.12879 | s2cid=167357427 | access-date=9 August 2013}}</ref><ref>{{cite news | url=http://www.businessweek.com/articles/2013-04-25/big-data-researchers-turn-to-google-to-beat-the-markets | title='Big Data' Researchers Turn to Google to Beat the Markets | author=Bernhard Warner | work=[[Bloomberg Businessweek]] | date=25 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite news | url=https://www.independent.co.uk/news/business/comment/hamish-mcrae/hamish-mcrae-need-a-valuable-handle-on-investor-sentiment-google-it-8590991.html | title=Hamish McRae: Need a valuable handle on investor sentiment? Google it | author=Hamish McRae | work=[[The Independent]] | date=28 April 2013 | access-date=9 August 2013 | location=London}}</ref><ref>{{cite web | url=http://www.ft.com/intl/cms/s/0/e5d959b8-acf2-11e2-b27f-00144feabdc0.html | title= Google search proves to be new word in stock market prediction | author=Richard Waters | work=[[Financial Times]] | date=25 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite news | url =https://www.bbc.co.uk/news/science-environment-22293693 | title=Google searches predict market moves | author=Jason Palmer | work=[[BBC]] | date=25 April 2013 | access-date=9 August 2013}}</ref>

第875行：第917行：

现代海量数据集算法研讨会(MMDS)聚集了计算机科学家、统计学家、数学家和数据分析从业者，讨论大数据的算法挑战。斯坦福大学。“ MMDS。现代海量数据集算法研讨会”。对于大数据，这样的量级概念是相对的。正如文中所说: “如果说过去的数据有什么指导意义的话，那么今天的大数据在不久的将来很可能不会被认为是这样的。”

−

'''''【终译版】'''''现代海量数据集（MMD）算法研讨会汇集了计算机科学家、统计学家、数学家和数据分析从业者，讨论大数据的算法挑战。斯坦福。“MMDS.现代海量数据集算法研讨会”。关于大数据，这样的量级概念是相对的。正如它所说，“如果过去有任何指导意义，那么今天的大数据在不久的将来很可能不会被认为是这样。”

+

'''''【终译版】'''''现代海量数据集（MMD）算法研讨会汇集了计算机科学家、统计学家、数学家和数据分析从业者，讨论大数据的算法挑战。关于大数据，这样的量级概念是相对的。正如它所说，“如果说过去的数据有什么指导意义的话，那么今天的大数据在不久的将来很可能不会被认为是这样的。”

===Sampling big data===

L（吕奥博）

35

个编辑

更改

大数据 (查看源代码)

2022年2月7日 (一) 21:44的版本

导航菜单

搜索