更改

大数据 (查看源代码)

2022年1月25日 (二) 10:36的版本

添加1,767字节、 2022年1月25日 (二) 10:36

V0.1_20220125_排版+图片

第1行：第1行： −

~~此词条暂由彩云小译翻译，翻译字数共8746，未经人工整理和审校，带来阅读不便，请见谅。~~

+

This article is about large collections of data. For the band, see Big Data (band). For the practice of buying and selling of personal and consumer data, see Surveillance capitalism.[[File:Hilbert InfoGrowth.png|thumb|right|400px|Non-linear growth of digital global information-storage capacity and the waning of analog storage<ref>{{cite journal|url= http://www.martinhilbert.net/WorldInfoCapacity.html|title= The World's Technological Capacity to Store, Communicate, and Compute Information|volume= 332|issue= 6025|pages= 60–65|journal=Science|access-date= 13 April 2016|bibcode= 2011Sci...332...60H|last1= Hilbert|first1= Martin|last2= López|first2= Priscila|year= 2011|doi= 10.1126/science.1200970|pmid= 21310967|s2cid= 206531385}}</ref>全球数字信息存储容量的非线性增长和模拟存储的减弱|链接=Special:FilePath/Hilbert_InfoGrowth.png]]

−

{{~~Short description~~|~~Information assets characterized by high~~ volume~~, velocity, and variety~~}}

+

'''Big data''' is a field that treats ways to analyze, systematically extract information from, or otherwise deal with [[data set]]s that are too large or complex to be dealt with by traditional [[data processing|data-processing]] [[application software]]. Data with many fields (columns) offer greater [[statistical power]], while data with higher complexity (more attributes or columns) may lead to a higher [[false discovery rate]].<ref>{{Cite journal|last=Breur|first=Tom|date=July 2016|title=Statistical Power Analysis and the contemporary "crisis" in social sciences|journal=Journal of Marketing Analytics |publisher=[[Palgrave Macmillan]]|location=London, England|volume=4 |issue=2–3 |pages=61–65 |doi=10.1057/s41270-016-0001-3 |issn=2050-3318|doi-access=free}}</ref> Big data analysis challenges include [[Automatic identification and data capture|capturing data]], [[Computer data storage|data storage]], [[data analysis]], search, [[Data sharing|sharing]], [[Data transmission|transfer]], [[Data visualization|visualization]], [[Query language|querying]], updating, [[information privacy]], and data source. Big data was originally associated with three key concepts: ''volume'', ''variety'', and ''velocity''.<ref name=":0" /> The analysis of big data presents challenges in sampling, and thus previously allowing for only observations and sampling. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and ''value''.

−

~~{{About~~|~~large collections of~~ data|~~the band~~|~~Big~~ Data ~~(band)~~|~~the practice~~ of ~~buying~~ and ~~selling~~ of ~~personal~~ and ~~consumer data|Surveillance capitalism}}~~

−

~~{{Use dmy dates|date=January 2020}}~~

+

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many fields (columns) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate. Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. The analysis of big data presents challenges in sampling, and thus previously allowing for only observations and sampling. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.

−

~~[[File:Hilbert InfoGrowth.png|thumb|right|400px|Non-linear growth of digital global~~ information-storage ~~capacity~~ and ~~the waning of analog storage<ref>{{cite journal|url= http~~:~~//www~~.~~martinhilbert.net/WorldInfoCapacity.html|title=~~ The ~~World's Technological Capacity to Store~~, ~~Communicate~~, and Compute Information|volume= 332|issue= 6025|pages= 60–65|journal=Science|access-date= 13 April 2016|bibcode= 2011Sci...332...60H|last1= Hilbert|first1= Martin|last2= López|first2= Priscila|year= 2011|doi= 10.1126/science.~~1200970|pmid= 21310967|s2cid= 206531385}}</ref>]]~~

+

大数据是一个研究如何分析、系统地从中提取信息或以其他方式处理传统数据处理应用软件无法处理的过于庞大或复杂的数据集的领域。具有多个字段(列)的数据提供了更强的统计能力，而具有更高复杂性(更多属性或列)的数据可能导致更高的错误发现率。大数据分析面临的挑战包括捕获数据、数据存储、数据分析、搜索、共享、传输、可视化、查询、更新、信息隐私和数据源。大数据最初与三个关键概念有关: 数量、多样性和速度。大数据的分析在取样方面提出了挑战，因此以前只允许观测和取样。因此，大数据通常包含的数据大小超过了传统软件在可接受的时间和价值内处理的能力。

−

~~thumb|right|400px|Non-linear growth of digital global information-storage capacity and the waning of analog storage~~

−

~~拇指 | 右 | 400px | 全球数字信息存储容量的非线性增长和模拟存储的减弱~~

−

'''Big data''' is a field that treats ways to analyze, systematically extract information from, or otherwise deal with [[data set]]s that are too large or complex to be dealt with by traditional [[data processing|data-processing]] [[application software]]. Data with many fields (columns) offer greater [[statistical power]], while data with higher complexity (more attributes or columns) may lead to a higher [[false discovery rate]].<ref>{{Cite journal|last=Breur|first=Tom|date=July 2016|title=Statistical Power Analysis and the contemporary "crisis" in social sciences|journal=Journal of Marketing Analytics |publisher=[[Palgrave Macmillan]]|location=London, England|volume=4 |issue=2–3 |pages=61–65 |doi=10.1057/s41270-016-0001-3 |issn=2050-3318|doi-access=free}}</ref> Big data analysis challenges include [[Automatic identification and data capture|capturing data]], [[Computer data storage|data storage]], [[data analysis]], search, [[Data sharing|sharing]], [[Data transmission|transfer]], [[Data visualization|visualization]], [[Query language|querying]], updating, [[information privacy]], and data source. Big data was originally associated with three key concepts: ''volume'', ''variety'', and ''velocity''.<ref name=":0" /> The analysis of big data presents challenges in sampling, and thus previously allowing for only observations and sampling. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and ''value''.

−

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many fields (columns) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate. Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. The analysis of big data presents challenges in sampling, and thus previously allowing for only observations and sampling. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.

+

'''''【终译版】'''''。

−

大数据是一个研究如何分析、系统地从中提取信息或以其他方式处理传统数据处理应用软件无法处理的过于庞大或复杂的数据集的领域。具有多个字段(列)的数据提供了更强的统计能力，而具有更高复杂性(更多属性或列)的数据可能导致更高的错误发现率。大数据分析面临的挑战包括捕获数据、数据存储、数据分析、搜索、共享、传输、可视化、查询、更新、信息隐私和数据源。大数据最初与三个关键概念有关: 数量、多样性和速度。大数据的分析在取样方面提出了挑战，因此以前只允许观测和取样。因此，大数据通常包含的数据大小超过了传统软件在可接受的时间和价值内处理的能力。

+

。

Current usage of the term ''big data'' tends to refer to the use of [[predictive analytics]], [[user behavior analytics]], or certain other advanced data analytics methods that extract [[Data valuation|value]] from big data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that's not the most relevant characteristic of this new data ecosystem."<ref>{{cite journal |last1=boyd |first1=dana |last2=Crawford |first2=Kate |title=Six Provocations for Big Data |journal=Social Science Research Network: A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society |date=21 September 2011 |doi= 10.2139/ssrn.1926431|s2cid=148610111 |url=http://osf.io/nrjhn/ }}</ref>

第25行：第19行：

目前大数据这个术语的使用倾向于使用预测分析分析，用户行为分析，或者其他一些高级的数据分析方法，这些方法从大数据中提取价值，很少使用特定规模的数据集。“毫无疑问，现在可用的数据量确实很大，但这不是这个新数据生态系统最相关的特征。”对数据集的分析可以发现与“现场业务趋势、预防疾病、打击犯罪等”的新关联。科学家、企业管理人员、医疗从业人员、广告业者和政府都经常遇到大型数据集的困难，这些数据集涉及互联网搜索、金融技术、医疗保健分析、地理信息系统、城市信息学和经济信息学。科学家在电子科学工作中遇到了一些限制，包括气象学、基因组学、连接组学、复杂的物理模拟、生物学和环境研究。

+

'''''【终译版】'''''。

+

。

The size and number of available data sets have grown rapidly as data is collected by devices such as [[mobile device]]s, cheap and numerous information-sensing [[Internet of things]] devices, aerial ([[remote sensing]]), software logs, [[Digital camera|cameras]], microphones, [[radio-frequency identification]] (RFID) readers and [[wireless sensor networks]].<ref>{{cite web |author= Hellerstein, Joe |title= Parallel Programming in the Age of Big Data |date= 9 November 2008 |work= Gigaom Blog |url= http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/}}</ref><ref>{{cite book |first1= Toby |last1= Segaran |first2= Jeff |last2= Hammerbacher |title= Beautiful Data: The Stories Behind Elegant Data Solutions |url= https://books.google.com/books?id=zxNglqU1FKgC |year= 2009 |publisher= O'Reilly Media |isbn= 978-0-596-15711-1 |page= 257}}</ref> The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;<ref name="martinhilbert.net">{{cite journal | vauthors = Hilbert M, López P | title = The world's technological capacity to store, communicate, and compute information | journal = Science | volume = 332 | issue = 6025 | pages = 60–5 | date = April 2011 | pmid = 21310967 | doi = 10.1126/science.1200970 | url = http://www.uvm.edu/pdodds/files/papers/others/2011/hilbert2011a.pdf | bibcode = 2011Sci...332...60H | s2cid = 206531385 }}</ref> {{As of|2012|lc=on}}, every day 2.5 [[exabyte]]s (2.5×2<sup>60</sup> bytes) of data are generated.<ref>{{cite web|url= http://www.ibm.com/big-data/us/en/ |title= IBM What is big data? – Bringing big data to the enterprise |publisher= ibm.com |access-date= 26 August 2013}}</ref> Based on an [[International Data Corporation|IDC]] report prediction, the global data volume was predicted to grow exponentially from 4.4 [[zettabyte]]s to 44 zettabytes between 2013 and 2020. By 2025, IDC predicts there will be 163 zettabytes of data.<ref>{{Cite web| url=https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf| title=Data Age 2025: The Evolution of Data to Life-Critical|last1=Reinsel|first1=David|last2=Gantz|first2=John|date=13 April 2017|website=seagate.com|publisher=[[International Data Corporation]]|location=Framingham, MA, US|access-date=2 November 2017|last3=Rydning|first3=John}}</ref> One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.<ref>Oracle and FSN, [http://www.fsn.co.uk/channel_bi_bpm_cpm/mastering_big_data_cfo_strategies_to_transform_insight_into_opportunity "Mastering Big Data: CFO Strategies to Transform Insight into Opportunity"] {{Webarchive|url=https://web.archive.org/web/20130804062518/http://www.fsn.co.uk/channel_bi_bpm_cpm/mastering_big_data_cfo_strategies_to_transform_insight_into_opportunity |date=4 August 2013 }}, December 2012</ref>

第31行：第30行：

随着移动设备、廉价且数量众多的信息感知物联网设备、天线(遥感)、软件日志、相机、麦克风、射频识别读取器和无线传感器网络等设备收集数据，可用数据集的规模和数量迅速增长。自20世纪80年代以来，世界人均存储信息的技术容量大约每40个月翻一番; 每天产生2.5艾字节(2.5 × 260字节)的数据。根据 IDC 的报告预测，全球数据量将在2013年到2020年间成倍增长，从4.4 zettabytes 增长到44 zettabytes。国际数据公司预测，到2025年，将有163兆字节的数据。对于大型企业来说，一个问题是确定谁应该拥有影响整个组织的大数据计划。Oracle 和 FSN，“ Mastering Big Data: CFO Strategies to Transform Insight into Opportunity”，December 2012

+

'''''【终译版】'''''。

+

。

[[Relational database management system]]s and desktop statistical software packages used to visualize data often have difficulty processing and analyzing big data. The processing and analysis of big data may require "massively parallel software running on tens, hundreds, or even thousands of servers".<ref>{{cite web |author= Jacobs, A. |title= The Pathologies of Big Data |date= 6 July 2009 |work= ACMQueue |url= http://queue.acm.org/detail.cfm?id=1563874}}</ref> What qualifies as "big data" varies depending on the capabilities of those analyzing it and their tools. Furthermore, expanding capabilities make big data a moving target. "For some organizations, facing hundreds of [[gigabyte]]s of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."<ref>{{cite journal|last1=Magoulas|first1=Roger|last2=Lorica|first2=Ben|date=February 2009|title=Introduction to Big Data|url=https://academics.uccs.edu/~ooluwada/courses/datamining/ExtraReading/BigData|journal=Release 2.0|location=Sebastopol CA|publisher=O'Reilly Media|issue=11}}</ref>

第37行：第41行：

用于数据可视化的关系数据库管理系统和桌面统计软件包通常难以处理和分析大数据。大数据的处理和分析可能需要“运行在数十、数百甚至数千台服务器上的大规模并行处理机软件”。什么是“大数据”取决于那些分析它的人和他们的工具的能力。此外，不断扩大的能力使得大数据成为一个移动的目标。”对于一些组织来说，第一次面对数百千兆字节的数据可能需要重新考虑数据管理选项。对于其他人来说，数据大小可能需要几十或几百万兆字节才能成为一个重要的考虑因素。”

+

'''''【终译版】'''''。

+

。

==Definition==

第48行：第57行：

= = 定义 = = 大数据这个术语从1990年代就开始使用了，有些人认为是约翰 · 马歇推广了这个术语。大数据通常包括大小超出常用软件工具能力的数据集，这些软件工具可以在可承受的时间内捕获、管理和处理数据。大数据哲学包括非结构化、半结构化和结构化数据，但主要关注的是非结构化数据数据。大数据“大小”是一个不断变化的目标; 从几十 tb 到许多 ztabytes 的数据不等。大数据需要一系列技术和新的集成形式，以揭示来自多样化、复杂和大规模数据集的洞察力。

+

'''''【终译版】'''''。

+

。

"Variety", "veracity", and various other "Vs" are added by some organizations to describe it, a revision challenged by some industry authorities.<ref>{{cite magazine|last=Grimes|first=Seth|title=Big Data: Avoid 'Wanna V' Confusion| url=http://www.informationweek.com/big-data/big-data-analytics/big-data-avoid-wanna-v-confusion/d/d-id/1111077|magazine=[[InformationWeek]]|access-date = 5 January 2016}}</ref> The Vs of big data were often referred to as the "three Vs", "four Vs", and "five Vs". They represented the qualities of big data in volume, variety, velocity, [[veracity (data)|veracity]], and value.<ref name=":0">{{Cite web|date=2016-09-17|title=The 5 V's of big data|url=https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data/|access-date=2021-01-20|website=Watson Health Perspectives|language=en-US}}</ref> Variability is often included as an additional quality of big data.

第54行：第68行：

一些组织添加了“多样性”、“准确性”和其他各种“ v”来描述它，这个修订受到了一些行业权威的质疑。大数据 Vs 通常被称为“三个 Vs”、“四个 Vs”和“五个 Vs”。它们在数量、多样性、速度、准确性和价值等方面代表了大数据的特性。可变性通常作为大数据的附加质量被包括在内。

+

'''''【终译版】'''''。

+

。

A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and notes, "This represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of some of the guarantees and capabilities made by [[Relational database|Codd's relational model]]."<ref>{{Cite book|last=Fox|first=Charles|date=25 March 2018|title=Data Science for Transport| url=https://www.springer.com/us/book/9783319729527|publisher=Springer|isbn=9783319729527|series=Springer Textbooks in Earth Sciences, Geography and Environment}}</ref>

第60行：第79行：

2018年的一个定义指出“大数据是需要并行计算工具来处理数据的地方”，并指出，“这代表了通过并行编程理论使用的计算机科学发生了一个明显而清晰的变化，以及 Codd 的关系模型数据库所做出的一些保证和能力的丧失。”

+

'''''【终译版】'''''。

+

。

In a comparative study of big datasets, [[Rob Kitchin|Kitchin]] and McArdle found that none of the commonly considered characteristics of big data appear consistently across all of the analyzed cases.<ref>{{cite journal | last1 = Kitchin | first1 = Rob | last2 = McArdle | first2 = Gavin | year = 2016 | title = What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets | journal = Big Data & Society | volume = 3 | pages = 1–10 | doi = 10.1177/2053951716631130 | s2cid = 55539845 }}</ref> For this reason, other studies identified the redefinition of power dynamics in knowledge discovery as the defining trait.<ref>{{cite journal | last1 = Balazka | first1 = Dominik | last2 = Rodighiero | first2 = Dario | year = 2020 | title = Big Data and the Little Big Bang: An Epistemological (R)evolution | journal = Frontiers in Big Data | volume = 3 | page = 31 | doi = 10.3389/fdata.2020.00031 | pmid = 33693404 | pmc = 7931920 | hdl = 1721.1/128865 | hdl-access = free | doi-access = free }}</ref> Instead of focusing on intrinsic characteristics of big data, this alternative perspective pushes forward a relational understanding of the object claiming that what matters is the way in which data is collected, stored, made available and analyzed.

第66行：第90行：

在对大数据集的比较研究中，Kitchin 和 McArdle 发现，在所有分析的案例中，大数据通常被认为的特征没有一个是一致的。因此，其他研究将知识发现中权力动力学的重新定义确定为知识发现的定义特征。这种不同的视角不是关注大数据的内在特征，而是推动了对对象的关系理解，声称重要的是数据收集、存储、提供和分析的方式。

+

'''''【终译版】'''''。

+

。

=== Big data vs. business intelligence ===

第80行：第109行：

* 大数据使用数学分析、优化、归纳统计和概念从非线性识别比林斯公司“非线性系统辨识: NARMAX 方法在时间、频率和时空域”。Wiley，2013从低信息密度的大量数据中推断法则(回归、非线性关系和因果效应) ，以揭示关系和依赖性，或者执行结果和行为的预测。

−

~~==Characteristics==~~

−

~~[[File: Big Data.png|thumb|Shows the growth of big data's primary characteristics of volume, velocity, and variety]]~~

−

~~Big data can be described by the following characteristics:~~

−

~~thumb|Shows the growth of big data~~'~~s primary characteristics of volume, velocity, and variety~~

+

'''''【终译版】'''''。

−

~~Big data can be described by the following characteristics:~~

−

~~显示大数据在数量、速度和变化方面的主要特征大数据可以用以下特征来描述:~~

+

。

−

~~; Volume~~: ~~The quantity of generated and stored data~~. ~~The size~~ of ~~the data determines the value and potential insight, and whether it can be considered~~ big data ~~or not. The size~~ of ~~big data is usually larger than terabytes~~ and ~~petabytes~~.~~<ref>{{cite journal~~ |~~last1=Sagiroglu |first1=Seref |title~~=Big data: ~~A review |journal=2013 International Conference on Collaboration Technologies and Systems (CTS) |date=2013 |pages=42–47 |doi=10.1109/CTS.2013.6567202|isbn=978-1-4673-6404-1 |s2cid=5724608 }}</ref>~~

+

==Characteristics==

+

[[File: Big Data.png|thumb|Shows the growth of big data's primary characteristics of volume, velocity, and variety.显示大数据在数量、速度和变化方面的主要特征大数据可以用以下特征来描述:|链接=Special:FilePath/Big_Data.png]]

+

Big data can be described by the following characteristics:

−

; Volume: The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. The size of big data is usually larger than terabytes and petabytes.

+

; Volume: The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. The size of big data is usually larger than terabytes and petabytes.<ref>{{cite journal |last1=Sagiroglu |first1=Seref |title=Big data: A review |journal=2013 International Conference on Collaboration Technologies and Systems (CTS) |date=2013 |pages=42–47 |doi=10.1109/CTS.2013.6567202|isbn=978-1-4673-6404-1 |s2cid=5724608 }}</ref>:The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. The size of big data is usually larger than terabytes and petabytes.

数量: 生成和存储数据的数量。数据的大小决定了数据的价值和潜在洞察力，以及它是否可以被视为大数据。大数据的大小通常大于 tb 和 pb。

−

; Variety: The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. The big data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed (velocity), and huge in size (volume). Later, these tools and technologies were explored and used for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, sensors, etc. Big data draws from text, images, audio, video; plus it completes missing pieces through [[data fusion]].

+

; Variety: The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. The big data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed (velocity), and huge in size (volume). Later, these tools and technologies were explored and used for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, sensors, etc. Big data draws from text, images, audio, video; plus it completes missing pieces through [[data fusion]].:The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. The big data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed (velocity), and huge in size (volume). Later, these tools and technologies were explored and used for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, sensors, etc. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.

−

~~; Variety~~: The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. The big data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed (velocity), and huge in size (volume). Later, these tools and technologies were explored and used for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, sensors, etc. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.

品种: 数据的类型和性质。早期的技术(如 rdbms)能够有效地处理结构化数据。然而，从结构化到半结构化或非结构化的类型和性质的变化对现有的工具和技术提出了挑战。大数据技术的发展主要是为了获取、存储和处理高速、大容量的半结构化和非结构化(变化)数据。后来，这些工具和技术也被用于处理结构化数据，但更适合存储。最终，结构化数据的处理仍然是可选的，要么使用大数据，要么使用传统的 rdbms。这有助于分析数据，从而有效利用通过社交媒体、日志文件、传感器等收集的数据中暴露出来的隐藏洞察力。大数据从文本、图像、音频、视频中提取，并通过数据融合完成缺失的部分。

−

; Velocity: The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to [[small data]], big data is produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.<ref>{{cite journal |last1=Kitchin |first1=Rob |last2=McArdle |first2=Gavin |title=What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets |journal=Big Data & Society |date=17 February 2016 |volume=3 |issue=1 |pages=205395171663113 |doi=10.1177/2053951716631130|doi-access=free }}</ref>

+

; Velocity: The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to [[small data]], big data is produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.<ref>{{cite journal |last1=Kitchin |first1=Rob |last2=McArdle |first2=Gavin |title=What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets |journal=Big Data & Society |date=17 February 2016 |volume=3 |issue=1 |pages=205395171663113 |doi=10.1177/2053951716631130|doi-access=free }}</ref>:The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data is produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.

−

~~; Velocity~~: The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data is produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.

速度: 生成和处理数据以满足增长和发展道路上的需求和挑战的速度。大数据通常是实时的。与小数据相比，大数据的产生更加持续。与大数据相关的两种速度是生成频率和处理、记录和发布频率。

−

;Veracity: The truthfulness or reliability of the data, which refers to the data quality and the data value.<ref>{{Cite journal|last1=Onay|first1=Ceylan|last2=Öztürk|first2=Elif|date=2018|title=A review of credit scoring research in the age of Big Data|journal=Journal of Financial Regulation and Compliance|volume=26|issue=3|pages=382–405|doi=10.1108/JFRC-06-2017-0054|s2cid=158895306}}</ref> Big data must not only be large in size, but also must be reliable in order to achieve value in the analysis of it. The [[data quality]] of captured data can vary greatly, affecting an accurate analysis.<ref>[https://web.archive.org/web/20180731105912/https://spotlessdata.com/blog/big-datas-fourth-v Big Data's Fourth V]</ref>

+

;Veracity: The truthfulness or reliability of the data, which refers to the data quality and the data value.<ref>{{Cite journal|last1=Onay|first1=Ceylan|last2=Öztürk|first2=Elif|date=2018|title=A review of credit scoring research in the age of Big Data|journal=Journal of Financial Regulation and Compliance|volume=26|issue=3|pages=382–405|doi=10.1108/JFRC-06-2017-0054|s2cid=158895306}}</ref> Big data must not only be large in size, but also must be reliable in order to achieve value in the analysis of it. The [[data quality]] of captured data can vary greatly, affecting an accurate analysis.<ref>[https://web.archive.org/web/20180731105912/https://spotlessdata.com/blog/big-datas-fourth-v Big Data's Fourth V]</ref>:The truthfulness or reliability of the data, which refers to the data quality and the data value. Big data must not only be large in size, but also must be reliable in order to achieve value in the analysis of it. The data quality of captured data can vary greatly, affecting an accurate analysis.Big Data's Fourth V

−

~~;Veracity~~: The truthfulness or reliability of the data, which refers to the data quality and the data value. Big data must not only be large in size, but also must be reliable in order to achieve value in the analysis of it. The data quality of captured data can vary greatly, affecting an accurate analysis.Big Data's Fourth V

准确性: 数据的真实性或可靠性，是指数据的质量和数据的价值。大数据不仅要大，而且要可靠，才能在分析中获得价值。捕获的数据的数据质量会有很大的差异，影响准确的分析

−

; Value: The worth in information that can be achieved by the processing and analysis of large datasets. Value also can be measured by an assessment of the other qualities of big data.<ref>{{Cite web|title=Measuring the Business Value of Big Data {{!}} IBM Big Data & Analytics Hub|url=https://www.ibmbigdatahub.com/blog/measuring-business-value-big-data|access-date=2021-01-20|website=www.ibmbigdatahub.com}}</ref> Value may also represent the profitability of information that is retrieved from the analysis of big data.

+

; Value: The worth in information that can be achieved by the processing and analysis of large datasets. Value also can be measured by an assessment of the other qualities of big data.<ref>{{Cite web|title=Measuring the Business Value of Big Data {{!}} IBM Big Data & Analytics Hub|url=https://www.ibmbigdatahub.com/blog/measuring-business-value-big-data|access-date=2021-01-20|website=www.ibmbigdatahub.com}}</ref> Value may also represent the profitability of information that is retrieved from the analysis of big data.:The worth in information that can be achieved by the processing and analysis of large datasets. Value also can be measured by an assessment of the other qualities of big data. Value may also represent the profitability of information that is retrieved from the analysis of big data.

−

~~; Value~~: The worth in information that can be achieved by the processing and analysis of large datasets. Value also can be measured by an assessment of the other qualities of big data. Value may also represent the profitability of information that is retrieved from the analysis of big data.

价值: 通过处理和分析大型数据集所能获得的信息价值。价值也可以通过评估大数据的其他特性来衡量。价值还可以表示从大数据分析中检索到的信息的利润率。

−

; Variability: The characteristic of the changing formats, structure, or sources of big data. Big data can include structured, unstructured, or combinations of structured and unstructured data. Big data analysis may integrate raw data from multiple sources. The processing of raw data may also involve transformations of unstructured data to structured data.

+

; Variability: The characteristic of the changing formats, structure, or sources of big data. Big data can include structured, unstructured, or combinations of structured and unstructured data. Big data analysis may integrate raw data from multiple sources. The processing of raw data may also involve transformations of unstructured data to structured data.:The characteristic of the changing formats, structure, or sources of big data. Big data can include structured, unstructured, or combinations of structured and unstructured data. Big data analysis may integrate raw data from multiple sources. The processing of raw data may also involve transformations of unstructured data to structured data.

−

~~; Variability~~: The characteristic of the changing formats, structure, or sources of big data. Big data can include structured, unstructured, or combinations of structured and unstructured data. Big data analysis may integrate raw data from multiple sources. The processing of raw data may also involve transformations of unstructured data to structured data.

可变性: 大数据的格式、结构或来源不断变化的特征。大数据可以包括结构化、非结构化，或结构化和非结构化数据的组合。大数据分析可以整合来自多个来源的原始数据。对原始数据的处理也可能涉及到非结构化数据到结构化数据的转换。

第131行：第148行：

大数据的其他可能特征是:

−

;Exhaustive: Whether the entire system (i.e., <math display="inline">n</math>=all) is captured or recorded or not. Big data may or may not include all the available data from sources.

+

;Exhaustive: Whether the entire system (i.e., <math display="inline">n</math>=all) is captured or recorded or not. Big data may or may not include all the available data from sources.:Whether the entire system (i.e., n=all) is captured or recorded or not. Big data may or may not include all the available data from sources.

−

~~;Exhaustive~~: Whether the entire system (i.e., n=all) is captured or recorded or not. Big data may or may not include all the available data from sources.

详尽: 是否捕获或记录整个系统(即 n = all)。大数据可能包括也可能不包括所有来源的可用数据。

−

; Fine-grained and uniquely lexical: Respectively, the proportion of specific data of each element per element collected and if the element and its characteristics are properly indexed or identified.

+

; Fine-grained and uniquely lexical: Respectively, the proportion of specific data of each element per element collected and if the element and its characteristics are properly indexed or identified.:Respectively, the proportion of specific data of each element per element collected and if the element and its characteristics are properly indexed or identified.

−

~~; Fine-grained and uniquely lexical~~: Respectively, the proportion of specific data of each element per element collected and if the element and its characteristics are properly indexed or identified.

细粒度和唯一词法: 分别指收集的每个元素的特定数据与每个元素的比例，以及元素及其特征是否正确编制了索引或标识。

−

; Relational: If the data collected contains common fields that would enable a conjoining, or meta-analysis, of different data sets.

+

; Relational: If the data collected contains common fields that would enable a conjoining, or meta-analysis, of different data sets.:If the data collected contains common fields that would enable a conjoining, or meta-analysis, of different data sets.:如果收集的数据包含公共字段，则可以对不同的数据集进行连接或元分析。

−

; ~~Relational~~: If the data collected ~~contains common~~ fields ~~that would enable a conjoining, or meta-analysis,~~ of ~~different~~ data ~~sets~~.

+

; Extensional:If new fields in each element of the data collected can be added or changed easily.:If new fields in each element of the data collected can be added or changed easily.

−

~~; Relational~~: ~~如果收集的数据包含公共字段，则可以对不同的数据集进行连接或元分析。~~

+

外延: 如果可以轻松地添加或更改收集的数据的每个元素中的新字段。

−

; ~~Extensional~~: If ~~new fields in each element~~ of the data ~~collected~~ can ~~be added or changed easily~~.

+

; Scalability: If the size of the big data storage system can expand rapidly.:If the size of the big data storage system can expand rapidly.

−

~~; Extensional~~: ~~If new fields in each element of the data collected can be added or changed easily.~~

+

可扩展性: 如果大数据存储系统的规模能够迅速扩大。

−

~~外延: 如果可以轻松地添加或更改收集的数据的每个元素中的新字段。~~

−

~~; Scalability: If the size of the big data storage system can expand rapidly.~~

−

~~; Scalability: If the size of the big data storage system can expand rapidly.~~

−

~~可扩展性: 如果大数据存储系统的规模能够迅速扩大。~~

+

'''''【终译版】'''''

==Architecture==

第167行：第175行：

海量数据存储库以多种形式存在，通常由有特殊需求的企业构建。从20世纪90年代开始，商业供应商一直提供大数据的并行数据库管理系统。多年来，温特公司发布了最大的数据库报告。

+

'''''【终译版】'''''。

+

。

[[Teradata]] Corporation in 1984 marketed the parallel processing [[DBC 1012]] system. Teradata systems were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were 2.5 GB in 1991 so the definition of big data continuously evolves. Teradata installed the first petabyte class RDBMS based system in 2007. {{as of|2017}}, there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50 PB. Systems up until 2008 were 100% structured relational data. Since then, Teradata has added unstructured data types including [[XML]], [[JSON]], and Avro.

第173行：第186行：

天睿在1984年推出了并行处理 DBC 1012系统。1992年，Teradata 系统首次存储和分析了1tb 的数据。1991年硬盘驱动器是2.5 GB，所以大数据的定义在不断发展。Teradata 在2007年安装了第一个 petabyte 类 RDBMS 为基础的系统。，安装了几十个 petabyte 类 Teradata 关系数据库，其中最大的超过50pb。直到2008年，系统都是100% 的结构化关系数据。从那时起，Teradata 增加了包括 XML、 JSON 和 Avro 在内的非结构化数据类型。

+

'''''【终译版】'''''。

+

。

In 2000, Seisint Inc. (now [[LexisNexis Risk Solutions]]) developed a [[C++]]-based distributed platform for data processing and querying known as the [[HPCC Systems]] platform. This system automatically partitions, distributes, stores and delivers structured, semi-structured, and unstructured data across multiple commodity servers. Users can write data processing pipelines and queries in a declarative dataflow programming language called ECL. Data analysts working in ECL are not required to define data schemas upfront and can rather focus on the particular problem at hand, reshaping data in the best possible manner as they develop the solution. In 2004, LexisNexis acquired Seisint Inc.<ref>{{cite news| url=https://www.washingtonpost.com/wp-dyn/articles/A50577-2004Jul14.html|title=LexisNexis To Buy Seisint For $775 Million|newspaper=[[The Washington Post]]|access-date=15 July 2004}}</ref> and their high-speed parallel processing platform and successfully used this platform to integrate the data systems of Choicepoint Inc. when they acquired that company in 2008.<ref>[https://www.washingtonpost.com/wp-dyn/content/article/2008/02/21/AR2008022100809.html The Washington Post]</ref> In 2011, the HPCC systems platform was open-sourced under the Apache v2.0 License.

第179行：第197行：

2000年，Seisint 公司(现在的 LexisNexis 风险解决方案)开发了一个基于 c + + 的分布式数据处理和查询平台，称为 HPCC 系统平台。这个系统自动分区、分发、存储和交付结构化、半结构化和跨多个商品服务器的非结构化数据。用户可以使用称为 ECL 的声明性数据流编程语言编写数据处理管道和查询。在 ECL 中工作的数据分析师不需要事先定义数据模式，而是可以专注于手头的特定问题，在开发解决方案时以尽可能好的方式重新构造数据。2004年，LexisNexis 收购了 Seisint 公司及其高速并行处理平台，并在2008年收购 Choicepoint 公司时，成功地利用该平台集成了该公司的数据系统。华盛顿邮报2011年，HPCC 系统平台根据 Apache v2.0许可证开源。

+

'''''【终译版】'''''。

+

。

[[CERN]] and other physics experiments have collected big data sets for many decades, usually analyzed via [[high-throughput computing]] rather than the map-reduce architectures usually meant by the current "big data" movement.

第185行：第208行：

CERN 和其他物理实验已经收集大数据集数十年了，通常是通过高吞吐量计算进行分析，而不是通常意味着当前“大数据”运动的地图缩减架构。

+

'''''【终译版】'''''。

+

。

In 2004, [[Google]] published a paper on a process called [[MapReduce]] that uses a similar architecture. The MapReduce concept provides a parallel processing model, and an associated implementation was released to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the "map" step). The results are then gathered and delivered (the "reduce" step). The framework was very successful,<ref>Bertolucci, Jeff [http://www.informationweek.com/software/hadoop-from-experiment-to-leading-big-data-platform/d/d-id/1110491? "Hadoop: From Experiment To Leading Big Data Platform"], "Information Week", 2013. Retrieved on 14 November 2013.</ref> so others wanted to replicate the algorithm. Therefore, an [[implementation]] of the MapReduce framework was adopted by an Apache open-source project named "[[Apache Hadoop|Hadoop]]".<ref>Webster, John. [http://research.google.com/archive/mapreduce-osdi04.pdf "MapReduce: Simplified Data Processing on Large Clusters"], "Search Storage", 2004. Retrieved on 25 March 2013.</ref> [[Apache Spark]] was developed in 2012 in response to limitations in the MapReduce paradigm, as it adds the ability to set up many operations (not just map followed by reducing).

第191行：第219行：

2004年，谷歌发表了一篇名为 MapReduce 的论文，该论文使用了类似的架构。MapReduce 概念提供了一个并行处理模型，并发布了一个相关的实现来处理大量的数据。使用 MapReduce，查询被拆分并分布在并行节点上，并且被并行处理(“映射”步骤)。然后收集和交付结果(“ reduce”步骤)。这个框架非常成功，Bertolucci，Jeff“ Hadoop: 从实验到领导大数据平台”，“信息周”，2013。检索于2013年11月14日，所以其他人希望复制该算法。因此，MapReduce 框架的实现被一个名为“ Hadoop”的 Apache 开源项目所采用。“ MapReduce: 大型集群上的简化数据处理”，“ Search Storage”，2004年。2013年3月25日。Apache Spark 是在2012年针对 MapReduce 范例的限制而开发的，因为它增加了设置许多操作的能力(不仅仅是映射后的减少)。

+

'''''【终译版】'''''。

+

。

[[MIKE2.0 Methodology|MIKE2.0]] is an open approach to information management that acknowledges the need for revisions due to big data implications identified in an article titled "Big Data Solution Offering".<ref>{{cite web| url=http://mike2.openmethodology.org/wiki/Big_Data_Solution_Offering| title=Big Data Solution Offering|publisher=MIKE2.0|access-date=8 December 2013}}</ref> The methodology addresses handling big data in terms of useful [[permutation]]s of data sources, [[complexity]] in interrelationships, and difficulty in deleting (or modifying) individual records.<ref>{{cite web|url=http://mike2.openmethodology.org/wiki/Big_Data_Definition|title=Big Data Definition|publisher=MIKE2.0|access-date=9 March 2013}}</ref>

第197行：第230行：

MIKE2.0是一个开放的信息管理方法，它承认由于《大数据解决方案提供》一文中确定的大数据影响，需要进行修订。这种方法论通过数据源的有用排列、相互关系的复杂性以及删除(或修改)单个记录的困难来处理大数据。

+

'''''【终译版】'''''。

+

。

Studies in 2012 showed that a multiple-layer architecture was one option to address the issues that big data presents. A [[List of file systems#Distributed parallel file systems|distributed parallel]] architecture distributes data across multiple servers; these parallel execution environments can dramatically improve data processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end-user by using a front-end application server.<ref>{{cite journal|last=Boja|first=C|author2=Pocovnicu, A |author3=Bătăgan, L. |title=Distributed Parallel Architecture for Big Data|journal=Informatica Economica|year=2012 |volume=16|issue=2| pages=116–127}}</ref>

第203行：第241行：

2012年的研究表明，多层架构是解决大数据带来的问题的一种选择。分布式并行体系结构将数据分布在多个服务器上; 这些并行执行环境可以显著提高数据处理速度。这种架构将数据插入到并行 DBMS 中，实现了 MapReduce 和 Hadoop 框架的使用。这种类型的框架通过使用前端应用程序服务器来使处理能力对最终用户透明。

+

'''''【终译版】'''''。

+

。

The [[data lake]] allows an organization to shift its focus from centralized control to a shared model to respond to the changing dynamics of information management. This enables quick segregation of data into the data lake, thereby reducing the overhead time.<ref>{{cite web|url= http://www.hcltech.com/sites/default/files/solving_key_businesschallenges_with_big_data_lake_0.pdf|title=Solving Key Business Challenges With a Big Data Lake|date=August 2014| website=Hcltech.com|access-date=8 October 2017}}</ref><ref>{{ cite web| url= https://secplab.ppgia.pucpr.br/files/papers/2015-0.pdf | title= Method for testing the fault tolerance of MapReduce frameworks | publisher=Computer Networks | year=2015}}</ref>

第209行：第252行：

数据库允许组织将其重点从集中控制转移到共享模型，以响应不断变化的信息管理动态。这样可以将数据快速隔离到数据湖中，从而减少开销时间。

+

'''''【终译版】'''''。

+

。

==Technologies==

第225行：第273行：

* 大数据技术，如商业智能、云计算和数据库

* 可视化，如图表、图形和其他数据显示

+

'''''【终译版】'''''。

+

。

Multidimensional big data can also be represented as [[OLAP]] data cubes or, mathematically, [[tensor]]s. [[Array DBMS|Array database systems]] have set out to provide storage and high-level query support on this data type.

第233行：第286行：

多维大数据也可以表示为 OLAP 数据立方体或者数学上的张量。阵列数据库系统已经着手为这种数据类型提供存储和高级查询支持。其他应用于大数据的技术包括高效的基于张量的计算，如多线性子空间学习、大规模并行处理(MPP)数据库、基于搜索的应用程序、数据挖掘、分布式文件系统、分布式缓存(如突发缓冲区和 Memcached)、分布式数据库、基于云和 hpc 的基础设施(应用程序、存储和计算资源) ，以及互联网。虽然已经开发了许多方法和技术，但是仍然很难实现大数据的机器学习。

+

'''''【终译版】'''''。

+

。

Some [[Massive parallel processing|MPP]] relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the [[RDBMS]].<ref>{{cite web |author=Monash, Curt |title=eBay's two enormous data warehouses |date=30 April 2009 |url=http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/}}<br />{{cite web |author=Monash, Curt |title=eBay followup – Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more |date=6 October 2010 |url =http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/}}</ref>{{promotional source|date=December 2018}}

第239行：第297行：

一些 MPP 关系数据库具有存储和管理 pb 级数据的能力。隐式是加载、监视、备份和优化 RDBMS 中大型数据表的使用的能力。< br/>

+

'''''【终译版】'''''。

+

。

[[DARPA]]'s [[Topological Data Analysis]] program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called "Ayasdi".<ref>{{cite web|url=http://www.ayasdi.com/resources/|title=Resources on how Topological Data Analysis is used to analyze big data|publisher=Ayasdi}}</ref>{{thirdpartyinline|date=December 2018}}

第245行：第308行：

美国国防部高级研究计划局的拓扑数据分析计划寻找海量数据集的基本结构。2008年，随着一家名为“ Ayasdi”的公司的成立，这项技术公之于众。

+

'''''【终译版】'''''。

+

。

The practitioners of big data analytics processes are generally hostile to slower shared storage,<ref>{{cite web |title=Storage area networks need not apply |author=CNET News |date=1 April 2011 |url=http://news.cnet.com/8301-21546_3-20049693-10253464.html}}</ref> preferring direct-attached storage ([[Direct-attached storage|DAS]]) in its various forms from solid state drive ([[SSD]]) to high capacity [[Serial ATA|SATA]] disk buried inside parallel processing nodes. The perception of shared storage architectures—[[storage area network]] (SAN) and [[network-attached storage]] (NAS)— is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.

第251行：第319行：

大数据分析处理的从业者通常不喜欢缓慢的共享存储，他们更喜欢各种形式的直接连接的存储设备，从固态硬盘(SSD)到埋藏在并行处理节点中的大容量 SATA 磁盘。对于共享存储架构ーー存储区域网络(SAN)和存储网络附加存储(NAS)ーー的看法是，它们相对缓慢、复杂和昂贵。这些特性与依赖于系统性能、商品基础设施和低成本的大数据分析系统不一致。

+

'''''【终译版】'''''。

+

。

Real or near-real-time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in direct-attached memory or disk is good—data on memory or disk at the other end of an [[Fiber connector|FC]] [[Storage area network|SAN]] connection is not. The cost of an [[Storage area network|SAN]] at the scale needed for analytics applications is much higher than other storage techniques.

第257行：第330行：

实时或接近实时的信息传递是大数据分析的定义特征之一。因此，无论何时何地，只要有可能，就可以避免延迟。直接连接的存储器或磁盘中的数据是好的ーー FC SAN 连接另一端的存储器或磁盘上的数据是坏的。在分析应用程序所需的规模上，SAN 的成本要比其他存储技术高得多。

+

'''''【终译版】'''''。

+

。

==Applications==

−

[[File:2013-09-11 Bus wrapped with SAP Big Data parked outside IDF13 (9730051783).jpg|thumb|Bus wrapped with [[SAP AG|SAP]] big data parked outside [[Intel Developer Forum|IDF13]].]]

+

[[File:2013-09-11 Bus wrapped with SAP Big Data parked outside IDF13 (9730051783).jpg|thumb|Bus wrapped with [[SAP AG|SAP]] big data parked outside [[Intel Developer Forum|IDF13]].|链接=Special:FilePath/2013-09-11_Bus_wrapped_with_SAP_Big_Data_parked_outside_IDF13_(9730051783).jpg]]

Big data has increased the demand of information management specialists so much so that [[Software AG]], [[Oracle Corporation]], [[IBM]], [[Microsoft]], [[SAP AG|SAP]], [[EMC Corporation|EMC]], [[Hewlett-Packard|HP]], and [[Dell]] have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.{{r|Economist}}

+

Big data has increased the demand of information management specialists so much so that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP, and Dell have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.

+

大数据极大地增加了信息管理专家的需求，以至于 Software AG、甲骨文公司、 IBM、微软、 SAP、 EMC、惠普和戴尔已经在数据管理和分析软件公司上花费了超过150亿美元。在2010年，这个行业价值超过1000亿美元，并且以每年近10% 的速度增长: 大约是整个软件行业的两倍。

−

Big data has increased the demand of information management specialists so much so that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP, and Dell have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.

+

'''''【终译版】'''''。

−

大数据极大地增加了信息管理专家的需求，以至于 Software AG、甲骨文公司、 IBM、微软、 SAP、 EMC、惠普和戴尔已经在数据管理和分析软件公司上花费了超过150亿美元。在2010年，这个行业价值超过1000亿美元，并且以每年近10% 的速度增长: 大约是整个软件行业的两倍。

+

。

Developed economies increasingly use data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide, and between 1 billion and 2 billion people accessing the internet.{{r|Economist}} Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which means more people became more literate, which in turn led to information growth. The world's effective capacity to exchange information through telecommunication networks was 281 [[petabytes]] in 1986, 471 [[petabytes]] in 1993, 2.2 exabytes in 2000, 65 [[exabytes]] in 2007<ref name="martinhilbert.net"/> and predictions put the amount of internet traffic at 667 exabytes annually by 2014.{{r|Economist}} According to one estimate, one-third of the globally stored information is in the form of alphanumeric text and still image data,<ref name="HilbertContent">{{cite journal|title= What is the Content of the World's Technologically Mediated Information and Communication Capacity: How Much Text, Image, Audio, and Video?| doi= 10.1080/01972243.2013.873748 | volume=30| issue=2 |journal=The Information Society|pages=127–143|year = 2014|last1 = Hilbert|first1 = Martin| s2cid= 45759014 | url= https://escholarship.org/uc/item/87w5f6wb }}</ref> which is the format most useful for most big data applications. This also shows the potential of yet unused data (i.e. in the form of video and audio content).

第272行：第353行：

发达经济体越来越多地使用数据密集型技术。全世界有46亿移动电话用户，10亿到20亿人使用互联网。从1990年到2005年，全世界有超过10亿人进入中产阶级，这意味着更多的人变得更有文化，进而导致信息增长。世界通过电信网络交换信息的有效容量在1986年为281千兆字节，1993年为471千兆字节，2000年为2.2千兆字节，2007年为65千兆字节，预计到2014年每年的互联网流量将达到667千兆字节。据估计，全球储存的信息有三分之一是字母数字文本和静止图像数据，这是大多数大数据应用程序最有用的格式。这也显示了尚未使用的数据的潜力(即。以视频和音频内容的形式)。

+

'''''【终译版】'''''。

+

。

While many vendors offer off-the-shelf products for big data, experts promote the development of in-house custom-tailored systems if the company has sufficient technical capabilities.<ref>{{cite web |url=http://www.kdnuggets.com/2014/07/interview-amy-gershkoff-ebay-in-house-BI-tools.html |title=Interview: Amy Gershkoff, Director of Customer Analytics & Insights, eBay on How to Design Custom In-House BI Tools |last1=Rajpurohit |first1=Anmol |date=11 July 2014 |website= KDnuggets|access-date=14 July 2014|quote=Generally, I find that off-the-shelf business intelligence tools do not meet the needs of clients who want to derive custom insights from their data. Therefore, for medium-to-large organizations with access to strong technical talent, I usually recommend building custom, in-house solutions.}}</ref>

第278行：第364行：

虽然许多供应商提供现成的大数据产品，但如果公司拥有足够的技术能力，专家则推动开发内部定制系统。

+

'''''【终译版】'''''。

+

。

===Government===

第285行：第376行：

在政府流程中使用和采用大数据可以在成本、生产力和创新方面提高效率，但也存在缺陷。数据分析往往需要多个政府部门(中央和地方)协同工作，创建新的创新流程，以实现预期成果。利用大数据的一个常见政府组织是国家安全局，该局不断监测互联网的活动，以搜索其系统可能发现的可疑或非法活动的潜在模式。

+

'''''【终译版】'''''。

+

。

[[Civil registration and vital statistics]] (CRVS) collects all certificates status from birth to death. CRVS is a source of big data for governments.

第291行：第387行：

民事登记和人口动态统计收集从出生到死亡的所有证明状态。民事登记和人口动态统计系统是政府大数据的一个来源。

+

'''''【终译版】'''''。

+

。

===International development===

第298行：第399行：

= = = 国际发展 = = 关于有效利用信息和通信技术促进发展的研究(又称“ ICT4D”)表明，大数据技术可以作出重要贡献，但也对国际发展提出独特的挑战。海量数据分析的进步为改善关键发展领域的决策提供了成本效益高的机会，这些领域包括保健、就业、经济生产力、犯罪、安全、自然灾害和资源管理。此外，用户生成的数据提供了新的机会，给未听到的声音。然而，发展中地区面临的长期挑战，如技术基础设施不足、经济和人力资源稀缺，加剧了人们对大数据的现有担忧，如隐私、方法不完善以及互操作性问题。“大数据促进发展”的挑战目前正朝着通过机器学习(被称为“人工智能促进发展(AI4D)”)应用这些数据的方向发展。希尔伯特 · 曼(2020)。AI4D: 人工智能促进发展。国际通信杂志，14(0) ，21. https://www.martinhilbert.net/ai4d-artificial-intelligence-for-development/

+

'''''【终译版】'''''。

+

。

====Benefits====

第317行：第423行：

* 详细程度: 提供具有许多相关变量的细粒度数据，以及新方面，例如网络连接

* 及时性和时间: 图表可以在收集后的几天内生成

+

'''''【终译版】'''''。

+

。

====Challenges====

第336行：第447行：

* 协调。数字跟踪数据仍然需要指标的国际协调。它增加了所谓的“数据融合”的挑战，不同来源的协调。

* 资料过载。分析师和机构不习惯于有效地处理大量的变量，而这是通过交互式仪表板有效地完成的。从业人员仍然缺乏一个标准的工作流程，使研究人员、用户和决策者能够高效和有效地工作。

+

'''''【终译版】'''''。

+

。

===Healthcare===

第343行：第459行：

通过提供个体化医学和规范性分析，临床风险干预和预测分析，减少浪费和护理变异性，病人数据的自动化外部和内部报告，标准化的医学术语和病人登记，大数据分析在医疗保健中得到了应用。一些需要改进的领域比实际执行的更具雄心壮志。在医疗保健系统中生成的数据级别并不是微不足道的。随着移动健康、电子健康和可穿戴技术的广泛应用，数据量将继续增长。这包括电子健康记录数据、成像数据、患者生成数据、传感器数据以及其他难以处理的数据形式。现在更加需要这种环境更加重视数据和信息质量。“大数据往往意味着‘脏数据’，数据不准确的比例随着数据量的增长而增加。”在大数据规模的人类检查是不可能的，在卫生服务中迫切需要智能工具，以实现准确性和可信度控制，并处理遗漏的信息。虽然现在医疗保健领域的大量信息都是电子化的，但是由于大多数信息都是非结构化的，难以使用，因此它们都被归入了大数据的范畴。在医疗保健中使用大数据引发了重大的道德挑战，从个人权利、隐私和自主权的风险，到透明度和信任度。

+

'''''【终译版】'''''。

+

。

Big data in health research is particularly promising in terms of exploratory biomedical research, as data-driven analysis can move forward more quickly than hypothesis-driven research.<ref>{{Cite journal|last=Copeland|first=CS|date=Jul–Aug 2017|title=Data Driving Discovery|url=http://claudiacopeland.com/uploads/3/5/5/6/35560346/_hjno_data_driving_discovery_2pv.pdf|journal=Healthcare Journal of New Orleans|pages=22–27}}</ref> Then, trends seen in data analysis can be tested in traditional, hypothesis-driven follow up biological research and eventually clinical research.

第349行：第470行：

健康研究中的大数据在探索性生物医学研究方面特别有前途，因为数据驱动的分析可以比假设驱动的研究更快地向前推进。然后，数据分析的趋势可以在传统的、假设驱动的后续生物学研究和最终的临床研究中得到验证。

+

'''''【终译版】'''''。

+

。

A related application sub-area, that heavily relies on big data, within the healthcare field is that of [[computer-aided diagnosis]] in medicine.

<ref name="CAD7challenges">{{cite journal | vauthors = Yanase J, Triantaphyllou E| title = A Systematic Survey of Computer-Aided Diagnosis in Medicine: Past and Present Developments. | journal = Expert Systems with Applications | volume = 138 | pages = 112821 | date = 2019 | doi = 10.1016/j.eswa.2019.112821 | s2cid = 199019309 }}</ref> For instance, for [[epilepsy]] monitoring it is customary to create 5 to 10 GB of data daily.

−

<ref>{{cite journal | vauthors = Dong X, Bahroos N, Sadhu E, Jackson T, Chukhman M, Johnson R, Boyd A, Hynes D| title = Leverage Hadoop framework for large scale clinical informatics applications | journal = AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science | pages = 53 | date = 2013 | volume = 2013 | pmid = 24303235 }}</ref> Similarly, a single uncompressed image of breast [[tomosynthesis]] averages 450 MB of data.

+

<ref name=":1">{{cite journal | vauthors = Dong X, Bahroos N, Sadhu E, Jackson T, Chukhman M, Johnson R, Boyd A, Hynes D| title = Leverage Hadoop framework for large scale clinical informatics applications | journal = AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science | pages = 53 | date = 2013 | volume = 2013 | pmid = 24303235 }}</ref> Similarly, a single uncompressed image of breast [[tomosynthesis]] averages 450 MB of data.

−

<ref>{{cite journal | vauthors = Clunie D| title = Breast tomosynthesis challenges digital imaging infrastructure | url = http://www.auntminnie.com/index.aspx?sec=prtf&sub=def&pag=dis&itemId=102872&printpage=true&fsec=ser&fsub=def | date = 2013 }}</ref>

+

<ref name=":2">{{cite journal | vauthors = Clunie D| title = Breast tomosynthesis challenges digital imaging infrastructure | url = http://www.auntminnie.com/index.aspx?sec=prtf&sub=def&pag=dis&itemId=102872&printpage=true&fsec=ser&fsub=def | date = 2013 }}</ref>

These are just a few of the many examples where [[computer-aided diagnosis]] uses big data. For this reason, big data has been recognized as one of the seven key challenges that computer-aided diagnosis systems need to overcome in order to reach the next level of performance.

<ref>

第359行：第485行：

</ref>

−

A related application sub-area, that heavily relies on big data, within the healthcare field is that of computer-aided diagnosis in medicine.

+

A related application sub-area, that heavily relies on big data, within the healthcare field is that of computer-aided diagnosis in medicine. For instance, for [[epilepsy]] monitoring it is customary to create 5 to 10 GB of data daily.

−

For instance, for epilepsy monitoring it is customary to create 5 to 10 GB of data daily.

+

<ref name=":1" /> Similarly, a single uncompressed image of breast [[tomosynthesis]] averages 450 MB of data.

−

Similarly, a single uncompressed image of breast tomosynthesis averages 450 MB of data.

+

−

+

These are just a few of the many examples where computer-aided diagnosis uses big data. For this reason, big data has been recognized as one of the seven key challenges that computer-aided diagnosis systems need to overcome in order to reach the next level of performance.

−

These are just a few of the many examples where computer-aided diagnosis uses big data. For this reason, big data has been recognized as one of the seven key challenges that computer-aided diagnosis systems need to overcome in order to reach the next level of performance.

+

在医疗保健领域，一个相关的应用子领域，严重依赖于大数据，那就是医药电脑辅助诊断。例如，对于癫痫监测，通常每天创建5到10gb 的数据。同样，一张未压缩的乳房断层合成图像平均有450mb 的数据。这些只是电脑辅助诊断使用大数据的众多例子中的一小部分。基于这个原因，大数据已经被认为是电脑辅助诊断系统需要克服的7个关键挑战之一，以达到下一个性能水平。

−

+

'''''【终译版】'''''。

−

在医疗保健领域，一个相关的应用子领域，严重依赖于大数据，那就是医药电脑辅助诊断。例如，对于癫痫监测，通常每天创建5到10gb 的数据。同样，一张未压缩的乳房断层合成图像平均有450mb 的数据。这些只是电脑辅助诊断使用大数据的众多例子中的一小部分。基于这个原因，大数据已经被认为是电脑辅助诊断系统需要克服的7个关键挑战之一，以达到下一个性能水平。

+

。

===Education===

第387行：第512行：

麦肯锡全球研究所的一项研究发现，受过高等培训的数据专业人员和管理人员短缺150万人，包括田纳西大学和加州大学伯克利分校在内的一些大学已经开设了硕士课程来满足这一需求。私营新兵训练营也开发了一些项目来满足这种需求，包括免费的数据孵化器项目或者付费的大会项目。在特定的营销领域，Wedel 和 Kannan 强调的问题之一是，营销有几个子领域(例如，广告、促销、产品开发、品牌) ，它们都使用不同类型的数据。

+

'''''【终译版】'''''。

+

。

===Media===

第403行：第533行：

* 数据捕捉

* 数据新闻: 出版商和记者使用大数据工具提供独特和创新的见解和信息图表。

+

'''''【终译版】'''''。

[[Channel 4]], the British [[Public service broadcasting in the United Kingdom|public-service]] television broadcaster, is a leader in the field of big data and [[data analysis]].<ref>{{cite web|url=https://www.ibc.org/tech-advances/big-data-and-analytics-c4-and-genius-digital/1076.article |title=Big data and analytics: C4 and Genius Digital|website=Ibc.org |access-date=8 October 2017}}</ref>

第409行：第542行：

英国公共服务电视广播公司第四频道是大数据和数据分析领域的领导者。

+

'''''【终译版】'''''。

===Insurance===

第416行：第551行：

= = = = 医疗保险提供者正在收集关于诸如食物和电视消费、婚姻状况、衣服尺寸和购买习惯等社会”健康决定因素”的数据，从而对医疗费用进行预测，以便发现客户的健康问题。这些预测目前是否被用于定价还存在争议。

+

'''''【终译版】'''''。

===Internet of things (IoT)===

第421行：第558行：

Big data and the IoT work in conjunction. Data extracted from IoT devices provides a mapping of device inter-connectivity. Such mappings have been used by the media industry, companies, and governments to more accurately target their audience and increase media efficiency. The IoT is also increasingly adopted as a means of gathering sensory data, and this sensory data has been used in medical,<ref>{{cite web|url=http://www.businesswire.com/news/home/20170109006500/en/QuiO-Named-Innovation-Champion-Accenture-HealthTech-Innovation|title=QuiO Named Innovation Champion of the Accenture HealthTech Innovation Challenge|website=Businesswire.com|access-date=8 October 2017| date=10 January 2017}}</ref> manufacturing<ref>{{cite web|url= https://www.predix.com/sites/default/files/IDC_OT_Final_whitepaper_249120.pdf |title=A Software Platform for Operational Technology Innovation|website=Predix.com|access-date=8 October 2017}}</ref> and transportation<ref name="BigDataIoT16">{{cite web|url =http://www.wiomax.com/big-data-driven-smart-transportation-the-underlying-big-story-of-smart-iot-transformed-mobility/| title=Big Data Driven Smart Transportation: the Underlying Story of IoT Transformed Mobility| author=Z. Jenipher Wang|date=March 2017}}</ref> contexts.

−

Big data and the IoT work in conjunction. Data extracted from IoT devices provides a mapping of device inter-connectivity. Such mappings have been used by the media industry, companies, and governments to more accurately target their audience and increase media efficiency. The IoT is also increasingly adopted as a means of gathering sensory data, and this sensory data has been used in medical, manufacturing and transportation contexts.

= = = 物联网(IoT) = = = 大数据与物联网协同工作。从物联网设备中提取的数据提供了设备间连接的映射。这样的映射已经被媒体行业、公司和政府用来更精确地定位他们的受众并提高媒体效率。物联网也越来越多地被用作收集感官数据的手段，这些感官数据已经被用于医疗、制造和运输领域。

+

'''''【终译版】'''''。

[[Kevin Ashton]], the digital innovation expert who is credited with coining the term,<ref>{{cite web|url=http://www.rfidjournal.com/articles/view?4986|title=That Internet Of Things Thing.}}</ref> defines the Internet of things in this quote: "If we had computers that knew everything there was to know about things—using data they gathered without any help from us—we would be able to track and count everything, and greatly reduce waste, loss, and cost. We would know when things needed replacing, repairing, or recalling, and whether they were fresh or past their best."

第433行：第570行：

数字创新专家凯文 · 阿什顿(Kevin Ashton)被誉为“物联网”(Internet of things)的创始人，他在这句话中给物联网下了这样的定义: “如果我们有一台了解一切的计算机——在没有我们帮助的情况下使用它们收集的数据——我们就能够跟踪和计算一切，大大减少浪费、损失和成本。”。我们会知道什么时候需要更换、修理或回收，以及这些东西是新的还是过时的。”

+

'''''【终译版】'''''。

===Information technology===

第440行：第579行：

= = = 信息技术 = = = 特别是自2015年以来，大数据作为帮助雇员提高工作效率和简化信息技术的收集和分发的一种工具，在企业运作中日益受到重视。使用大数据来解决企业内部的 IT 和数据收集问题被称为 IT 操作分析(ITOA)。通过将大数据原理应用到机器智能和深度计算的概念中，IT 部门可以预测潜在的问题并预防它们。ITOA 企业提供系统管理平台，将数据竖井集中在一起，从整个系统而不是从孤立的数据块中产生见解。

+

'''''【终译版】'''''。

==Case studies==

+

= 案例研究 =

===Government===

−

~~===Government===~~

+

===China===

−

~~= = 案例研究 = = = = = 政府 = = =~~

−

====China====

* The Integrated Joint Operations Platform (IJOP, 一体化联合作战平台) is used by the government to monitor the population, particularly [[Uyghurs]].<ref name="WP8218">{{cite news| url=https://www.washingtonpost.com/opinions/global-opinions/ethnic-cleansing-makes-a-comeback--in-china/2018/08/02/| archive-url=https://web.archive.org/web/20190331161843/https://www.washingtonpost.com/opinions/global-opinions/ethnic-cleansing-makes-a-comeback--in-china/2018/08/02/| url-status=dead| archive-date=31 March 2019|title=Ethnic cleansing makes a comeback – in China|author1=Josh Rogin|date=2 August 2018|access-date=4 August 2018|issue=Washington Post|quote=Add to that the unprecedented security and surveillance state in Xinjiang, which includes all-encompassing monitoring based on identity cards, checkpoints, facial recognition and the collection of DNA from millions of individuals. The authorities feed all this data into an artificial-intelligence machine that rates people's loyalty to the Communist Party in order to control every aspect of their lives.}}</ref> [[Biometrics]], including DNA samples, are gathered through a program of free physicals.<ref name="how022618">{{cite web|url= https://www.hrw.org/news/2018/02/26/china-big-data-fuels-crackdown-minority-region |title=China: Big Data Fuels Crackdown in Minority Region: Predictive Policing Program Flags Individuals for Investigations, Detentions|date=26 February 2018|website=hrw.org|publisher=Human Rights Watch|access-date=4 August 2018}}</ref>

*By 2020, China plans to give all its citizens a personal "social credit" score based on how they behave.<ref>{{cite news |title=Discipline and Punish: The Birth of China's Social-Credit System |url=https://www.thenation.com/article/china-social-credit-system/ |work=The Nation |date=23 January 2019}}</ref> The [[Social Credit System]], now being piloted in a number of Chinese cities, is considered a form of [[Mass surveillance in China|mass surveillance]] which uses big data analysis technology.<ref>{{cite news |title=China's behavior monitoring system bars some from travel, purchasing property |url=https://www.cbsnews.com/news/china-social-credit-system-surveillance-cameras/ |work=CBS News |date=24 April 2018}}</ref>{{Dubious|date=December 2021}}<ref>{{cite magazine |title=The complicated truth about China's social credit system |url=https://www.wired.co.uk/article/china-social-credit-system-explained |magazine=WIRED |date=21 January 2019}}</ref>

第518行：第656行：

* FICO Card Detection System protects accounts worldwide.

−

~~= = = = = =~~

* 沃尔玛每小时处理超过100万笔客户交易，这些交易被导入数据库，估计包含超过2.5拍字节(2560太字节)的数据，相当于美国国会图书馆所有书籍所含信息的167倍。

* 文德米尔不动产利用接近一亿名司机的位置资料，帮助置业人士计算每天不同时段往返工作地点的典型驾驶时间。

* FICO 卡检测系统保护全球账户。

+

'''''【终译版】'''''。

===Science===

第557行：第695行：

* 23andme 的 DNA 数据库包含了全世界超过100万人的基因信息。如果患者表示同意，该公司将向其他研究人员和制药公司出售“匿名聚合的基因数据”，用于研究目的。杜克大学心理学和神经科学教授艾哈迈德 · 哈里里自2009年以来一直在他的研究中使用23andme。一项在23andme 的数据库中确定了15个与抑郁症有关的基因组位点的研究导致了访问该数据库的需求激增，在论文发表后的两周内，23andMe 收到了近20个访问抑郁症数据的请求。

* 计算流体力学和水动力湍流研究产生大量数据集。约翰霍普金斯湍流数据库(JHTDB)包含来自各种湍流流动的直接数值模拟的超过350tb 的时空场。使用传统方法(如下载平面模拟输出文件)很难共享这些数据。JHTDB 内的数据可以通过“虚拟传感器”访问，访问方式多种多样，从直接的网络浏览器查询、通过 Matlab、 Python、 Fortran 和在客户平台上执行的 c 程序访问，到切断下载原始数据的服务。这些数据已在150多份科学出版物中得到应用。

+

'''''【终译版】'''''。

===Sports===

第568行：第707行：

使用运动传感器，大数据可以用来改进训练和了解竞争对手。使用大数据分析也可以预测比赛中的胜利者。未来玩家的表现也可以预测。因此，球员的价值和薪水是由整个赛季收集的数据决定的。

+

'''''【终译版】'''''。

In Formula One races, race cars with hundreds of sensors generate terabytes of data. These sensors collect data points from tire pressure to fuel burn efficiency.<ref>{{cite web|url=https://www.huffingtonpost.com/dave-ryan/sports-where-big-data-fin_b_8553884.html|title= Sports: Where Big Data Finally Makes Sense |author=Dave Ryan| work=huffingtonpost.com |date= 13 November 2015 |access-date=12 December 2015}}</ref>

第576行：第717行：

在一级方程式赛车比赛中，装有数百个传感器的赛车会产生太字节的数据。这些传感器收集数据点从轮胎压力到燃料燃烧效率。根据这些数据，工程师和数据分析师决定是否应该做出调整以赢得比赛。此外，通过使用大数据，比赛团队试图预测他们将提前完成比赛的时间，基于整个赛季收集的数据进行模拟。

+

'''''【终译版】'''''。

===Technology===

第592行：第736行：

* Facebook 处理来自用户群的500亿张照片。的月活跃用户达到了20亿。

* 谷歌每月处理大约1000亿次搜索。

+

'''''【终译版】'''''。

===COVID-19===

第599行：第746行：

在2019冠状病毒疾病流行期间，大数据被作为一种将疾病影响降到最低的方法而被提出来。在2019冠状病毒疾病流行期间，大数据被作为一种将疾病影响降到最低的方法。大数据的重要应用包括最大限度地减少病毒的传播、病例识别和医疗发展。

+

'''''【终译版】'''''。

Governments used big data to track infected people to minimise spread. Early adopters included China, Taiwan, South Korea, and Israel.<ref>{{cite news |last1=Manancourt |first1=Vincent |title=Coronavirus tests Europe's resolve on privacy |url=https://www.politico.eu/article/coronavirus-tests-europe-resolve-on-privacy-tracking-apps-germany-italy/ |access-date=30 October 2020 |work=Politico |date=10 March 2020}}</ref><ref>{{cite news |last1=Choudhury |first1=Amit Roy |title=Gov in the Time of Corona |url=https://govinsider.asia/innovation/gov-in-the-time-of-corona/ |access-date=30 October 2020 |work=Gov Insider |date=27 March 2020}}</ref><ref>{{cite news |last1=Cellan-Jones |first1=Rory |title=China launches coronavirus 'close contact detector' app |url=https://www.bbc.com/news/technology-51439401 |access-date=30 October 2020 |work=BBC |date=11 February 2020|archive-url=https://web.archive.org/web/20200228003957/https://www.bbc.com/news/technology-51439401 |archive-date=28 February 2020 }}</ref>

第605行：第754行：

各国政府利用大数据来追踪感染者，以最大限度地减少传播。早期的采用者包括中国、台湾、韩国和以色列。

+

'''''【终译版】'''''。

==Research activities==

第611行：第762行：

Encrypted search and cluster formation in big data were demonstrated in March 2014 at the American Society of Engineering Education. Gautam Siwach engaged at Tackling the challenges of Big Data by MIT Computer Science and Artificial Intelligence Laboratory and Amir Esmailpour at the UNH Research Group investigated the key features of big data as the formation of clusters and their interconnections. They focused on the security of big data and the orientation of the term towards the presence of different types of data in an encrypted form at cloud interface by providing the raw definitions and real-time examples within the technology. Moreover, they proposed an approach for identifying the encoding technique to advance towards an expedited search over encrypted text leading to the security enhancements in big data.

−

~~= = 研究活动 = =~~ 2014年3月，美国工程教育学会演示了大数据中的加密搜索和集群形成。由麻省理工学院计算机科学和人工智能实验室和 UNH 研究小组的 Amir Esmailpour 共同致力于解决大数据的挑战，他们研究了大数据的关键特征，即集群的形成及其相互联系。他们重点讨论了大数据的安全性以及该术语的方向，即通过提供技术中的原始定义和实时示例，在云界面上以加密形式存在不同类型的数据。此外，他们还提出了一种识别编码技术的方法，以便对加密文本进行快速搜索，从而加强大数据的安全性。

+

2014年3月，美国工程教育学会演示了大数据中的加密搜索和集群形成。由麻省理工学院计算机科学和人工智能实验室和 UNH 研究小组的 Amir Esmailpour 共同致力于解决大数据的挑战，他们研究了大数据的关键特征，即集群的形成及其相互联系。他们重点讨论了大数据的安全性以及该术语的方向，即通过提供技术中的原始定义和实时示例，在云界面上以加密形式存在不同类型的数据。此外，他们还提出了一种识别编码技术的方法，以便对加密文本进行快速搜索，从而加强大数据的安全性。

+

'''''【终译版】'''''。

In March 2012, The White House announced a national "Big Data Initiative" that consisted of six federal departments and agencies committing more than $200 million to big data research projects.<ref>{{cite web|title=Obama Administration Unveils "Big Data" Initiative:Announces $200 Million in New R&D Investments| url=https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf |url-status =live| archive-url =https://web.archive.org/web/20170121233309/https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf |via=[[NARA|National Archives]]|work=[[Office of Science and Technology Policy]]|archive-date=21 January 2017}}</ref>

第618行：第771行：

2012年3月，白宫宣布了一项全国性的“大数据倡议”，由六个联邦部门和机构组成，向大数据研究项目投入了2亿多美元。

+

'''''【终译版】'''''。

The initiative included a National Science Foundation "Expeditions in Computing" grant of $10 million over five years to the AMPLab<ref>{{cite web|url=http://amplab.cs.berkeley.edu |title=AMPLab at the University of California, Berkeley |publisher=Amplab.cs.berkeley.edu |access-date=5 March 2013}}</ref> at the University of California, Berkeley.<ref>{{cite web |title=NSF Leads Federal Efforts in Big Data|date=29 March 2012|publisher=National Science Foundation (NSF) |url= https://www.nsf.gov/news/news_summ.jsp?cntn_id=123607&org=NSF&from=news}}</ref> The AMPLab also received funds from [[DARPA]], and over a dozen industrial sponsors and uses big data to attack a wide range of problems from predicting traffic congestion<ref>{{cite conference| url=https://amplab.cs.berkeley.edu/publication/scaling-the-mobile-millennium-system-in-the-cloud-2/|author1=Timothy Hunter|date=October 2011|author2=Teodor Moldovan|author3=Matei Zaharia| author4 =Justin Ma|author5=Michael Franklin|author6-link=Pieter Abbeel|author6=Pieter Abbeel|author7=Alexandre Bayen |title=Scaling the Mobile Millennium System in the Cloud}}</ref> to fighting cancer.<ref>{{cite news|title=Computer Scientists May Have What It Takes to Help Cure Cancer|author=David Patterson|work=The New York Times| date=5 December 2011 |url=https://www.nytimes.com/2011/12/06/science/david-patterson-enlist-computer-scientists-in-cancer-fight.html}}</ref>

第624行：第779行：

这一举措包括美国国家科学基金会”计算机探险”项目，该项目将在五年内向加州大学伯克利分校的 AMPLab 提供1000万美元的资助。美国国防部高级研究计划局也从美国国防部高级研究计划局和十几个工业赞助商那里获得了资金，并利用大数据来解决从预测交通堵塞到抗击癌症的一系列问题。

+

'''''【终译版】'''''。

The White House Big Data Initiative also included a commitment by the Department of Energy to provide $25 million in funding over five years to establish the Scalable Data Management, Analysis and Visualization (SDAV) Institute,<ref>{{cite web|title=Secretary Chu Announces New Institute to Help Scientists Improve Massive Data Set Research on DOE Supercomputers |publisher=energy.gov |url=http://energy.gov/articles/secretary-chu-announces-new-institute-help-scientists-improve-massive-data-set-research-doe}}</ref> led by the Energy Department's [[Lawrence Berkeley National Laboratory]]. The SDAV Institute aims to bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the department's supercomputers.

第630行：第787行：

白宫大数据倡议还包括能源部承诺在未来五年内提供2500万美元的资金，用于建立可扩展的数据管理、分析和可视化研究所，由能源部下属的劳伦斯伯克利国家实验室数据中心领导。SDAV 研究所旨在汇集六个国家实验室和七所大学的专门知识，开发新的工具，以帮助科学家管理和可视化该部门超级计算机上的数据。

+

'''''【终译版】'''''。

The U.S. state of [[Massachusetts]] announced the Massachusetts Big Data Initiative in May 2012, which provides funding from the state government and private companies to a variety of research institutions.<ref>{{Cite news|last=Young|first=Shannon|date=2012-05-30|title=Mass. governor, MIT announce big data initiative|work=Boston.com|url=http://archive.boston.com/news/local/massachusetts/articles/2012/05/30/mass_gov_and_mit_to_announce_data_initiative/|access-date=2021-07-29}}</ref> The [[Massachusetts Institute of Technology]] hosts the Intel Science and Technology Center for Big Data in the [[MIT Computer Science and Artificial Intelligence Laboratory]], combining government, corporate, and institutional funding and research efforts.<ref>{{cite web|url=http://bigdata.csail.mit.edu/ |title=Big Data @ CSAIL |publisher= Bigdata.csail.mit.edu |date=22 February 2013 |access-date=5 March 2013}}</ref>

第636行：第795行：

美国马萨诸塞州在2012年5月宣布了马萨诸塞州大数据倡议，该倡议为各种研究机构提供来自州政府和私营公司的资金。麻省理工学院在麻省理工学院计算机科学和人工智能实验室中设有英特尔大数据科学技术中心，将政府、企业和机构的资金和研究成果结合在一起。

+

'''''【终译版】'''''。

The European Commission is funding the two-year-long Big Data Public Private Forum through their Seventh Framework Program to engage companies, academics and other stakeholders in discussing big data issues. The project aims to define a strategy in terms of research and innovation to guide supporting actions from the European Commission in the successful implementation of the big data economy. Outcomes of this project will be used as input for [[Horizon 2020]], their next [[Framework Programmes for Research and Technological Development|framework program]].<ref>{{cite web |url=https://cordis.europa.eu/project/id/318062 |title=Big Data Public Private Forum |publisher=cordis.europa.eu |date=1 September 2012 |access-date=16 March 2020 }}</ref>

第642行：第803行：

欧盟委员会正在通过其第七框架计划资助为期两年的大数据公私论坛，让公司、学术界和其他利益攸关方参与讨论大数据问题。该项目旨在确定研究和创新方面的战略，以指导欧洲委员会在成功实施大数据经济方面的支持行动。这个项目的成果将被用作地平线2020的投入，他们的下一个框架计划。

+

'''''【终译版】'''''。

The British government announced in March 2014 the founding of the [[Alan Turing Institute]], named after the computer pioneer and code-breaker, which will focus on new ways to collect and analyze large data sets.<ref>{{cite news|url=https://www.bbc.co.uk/news/technology-26651179|title=Alan Turing Institute to be set up to research big data|work=[[BBC News]]|access-date=19 March 2014|date=19 March 2014}}</ref>

第648行：第811行：

2014年3月，英国政府宣布成立艾伦图灵研究院数据中心，该中心以计算机先驱和密码破译者的名字命名，将致力于研究收集和分析大型数据集的新方法。

+

'''''【终译版】'''''。

At the [[University of Waterloo Stratford Campus]] Canadian Open Data Experience (CODE) Inspiration Day, participants demonstrated how using data visualization can increase the understanding and appeal of big data sets and communicate their story to the world.<ref>{{cite web|url= http://www.betakit.com/event/inspiration-day-at-university-of-waterloo-stratford-campus/| title=Inspiration day at University of Waterloo, Stratford Campus |publisher=betakit.com/ |access-date=28 February 2014}}</ref>

第654行：第819行：

在滑铁卢大学斯特拉特福德校区加拿大开放数据体验(CODE)启发日上，与会者展示了如何使用数据可视化数据可以增加对大数据集的理解和吸引力，并向世界传达他们的故事。

+

'''''【终译版】'''''。

[[Computational social science|Computational social sciences]] – Anyone can use application programming interfaces (APIs) provided by big data holders, such as Google and Twitter, to do research in the social and behavioral sciences.<ref name=pigdata>{{cite journal|last=Reips|first=Ulf-Dietrich|author2=Matzat, Uwe |title=Mining "Big Data" using Big Data Services |journal=International Journal of Internet Science |year=2014|volume=1|issue=1|pages=1–8 | url=http://www.ijis.net/ijis9_1/ijis9_1_editorial_pre.html}}</ref> Often these APIs are provided for free.<ref name="pigdata" /> [[Tobias Preis]] et al. used [[Google Trends]] data to demonstrate that Internet users from countries with a higher per capita gross domestic products (GDPs) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behaviors and real-world economic indicators.<ref>{{cite journal | vauthors = Preis T, Moat HS, Stanley HE, Bishop SR | title = Quantifying the advantage of looking forward | journal = Scientific Reports | volume = 2 | pages = 350 | year = 2012 | pmid = 22482034 | pmc = 3320057 | doi = 10.1038/srep00350 | bibcode = 2012NatSR...2E.350P }}</ref><ref>{{cite news | url=https://www.newscientist.com/article/dn21678-online-searches-for-future-linked-to-economic-success.html | title=Online searches for future linked to economic success |first=Paul |last=Marks |work=New Scientist | date=5 April 2012 | access-date=9 April 2012}}</ref><ref>{{cite news | url=https://arstechnica.com/gadgets/news/2012/04/google-trends-reveals-clues-about-the-mentality-of-richer-nations.ars | title=Google Trends reveals clues about the mentality of richer nations |first=Casey |last=Johnston |work=Ars Technica | date=6 April 2012 | access-date=9 April 2012}}</ref> The authors of the study examined Google queries logs made by ratio of the volume of searches for the coming year (2011) to the volume of searches for the previous year (2009), which they call the "[[future orientation index]]".<ref>{{cite web | url = http://www.tobiaspreis.de/bigdata/future_orientation_index.pdf | title = Supplementary Information: The Future Orientation Index is available for download | author = Tobias Preis | date = 24 May 2012 | access-date = 24 May 2012}}</ref> They compared the future orientation index to the per capita GDP of each country, and found a strong tendency for countries where Google users inquire more about the future to have a higher GDP.

第660行：第827行：

计算社会科学——任何人都可以使用大数据持有者(如谷歌和 Twitter)提供的应用程序编程接口(api)进行社会和行为科学研究。这些 api 通常是免费提供的。托拜厄斯 · 普雷斯等。使用谷歌趋势数据证明，来自人均国内生产总值(gdp)较高国家的互联网用户更有可能搜索有关未来的信息，而不是有关过去的信息。研究结果表明，在线行为和现实世界的经济指标之间可能存在某种联系。这项研究的作者审查了谷歌的查询日志，这些日志是根据下一年(2011年)的搜索量与上一年(2009年)的搜索量之比制作的，他们称之为“未来方向索引”。他们将未来方向指数与每个国家的人均 GDP 进行了比较，发现谷歌用户询问更多关于未来的信息的国家有一个更高的 GDP 趋势。

+

'''''【终译版】'''''。

[[Tobias Preis]] and his colleagues Helen Susannah Moat and [[H. Eugene Stanley]] introduced a method to identify online precursors for stock market moves, using trading strategies based on search volume data provided by Google Trends.<ref>{{cite journal | url =http://www.nature.com/news/counting-google-searches-predicts-market-movements-1.12879 | title=Counting Google searches predicts market movements | author=Philip Ball | journal=Nature | date=26 April 2013 | doi=10.1038/nature.2013.12879 | s2cid=167357427 | access-date=9 August 2013| author-link=Philip Ball }}</ref> Their analysis of [[Google]] search volume for 98 terms of varying financial relevance, published in ''[[Scientific Reports]]'',<ref>{{cite journal | vauthors = Preis T, Moat HS, Stanley HE | title = Quantifying trading behavior in financial markets using Google Trends | journal = Scientific Reports | volume = 3 | pages = 1684 | year = 2013 | pmid = 23619126 | pmc = 3635219 | doi = 10.1038/srep01684 | bibcode = 2013NatSR...3E1684P }}</ref> suggests that increases in search volume for financially relevant search terms tend to precede large losses in financial markets.<ref>{{cite news | url=http://bits.blogs.nytimes.com/2013/04/26/google-search-terms-can-predict-stock-market-study-finds/ | title= Google Search Terms Can Predict Stock Market, Study Finds | author=Nick Bilton | work=[[The New York Times]] | date=26 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite magazine | url=http://business.time.com/2013/04/26/trouble-with-your-investment-portfolio-google-it/ | title=Trouble With Your Investment Portfolio? Google It! | author=Christopher Matthews | magazine=[[Time (magazine)|Time]] | date=26 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite journal | url= http://www.nature.com/news/counting-google-searches-predicts-market-movements-1.12879 | title=Counting Google searches predicts market movements | author=Philip Ball |journal=[[Nature (journal)|Nature]] | date=26 April 2013 | doi=10.1038/nature.2013.12879 | s2cid=167357427 | access-date=9 August 2013}}</ref><ref>{{cite news | url=http://www.businessweek.com/articles/2013-04-25/big-data-researchers-turn-to-google-to-beat-the-markets | title='Big Data' Researchers Turn to Google to Beat the Markets | author=Bernhard Warner | work=[[Bloomberg Businessweek]] | date=25 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite news | url=https://www.independent.co.uk/news/business/comment/hamish-mcrae/hamish-mcrae-need-a-valuable-handle-on-investor-sentiment-google-it-8590991.html | title=Hamish McRae: Need a valuable handle on investor sentiment? Google it | author=Hamish McRae | work=[[The Independent]] | date=28 April 2013 | access-date=9 August 2013 | location=London}}</ref><ref>{{cite web | url=http://www.ft.com/intl/cms/s/0/e5d959b8-acf2-11e2-b27f-00144feabdc0.html | title= Google search proves to be new word in stock market prediction | author=Richard Waters | work=[[Financial Times]] | date=25 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite news | url =https://www.bbc.co.uk/news/science-environment-22293693 | title=Google searches predict market moves | author=Jason Palmer | work=[[BBC]] | date=25 April 2013 | access-date=9 August 2013}}</ref>

第666行：第835行：

Tobias Preis 和他的同事 Helen Susannah Moat 和 h. Eugene Stanley 介绍了一种方法，利用基于 Google Trends 提供的搜索量数据的交易策略来识别股市走势的在线前兆。他们在《科学报告》(Scientific Reports)上发表了对谷歌(Google)98个财务相关性不同的词条的搜索量分析，结果表明，财务相关搜索词的搜索量增加往往先于金融市场的巨额亏损。

+

'''''【终译版】'''''。

Big data sets come with algorithmic challenges that previously did not exist. Hence, there is seen by some to be a need to fundamentally change the processing ways.<ref>E. Sejdić (March 2014). "Adapt current tools for use with big data". ''Nature''. '''507''' (7492): 306.</ref>

第672行：第843行：

大数据集带来了以前不存在的算法挑战。因此，有些人认为有必要从根本上改变处理方式。Sejdi (2014年3月)。“调整现有工具，以便与大数据一起使用”。自然。507 (7492): 306.

+

'''''【终译版】'''''。

The Workshops on Algorithms for Modern Massive Data Sets (MMDS) bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to discuss algorithmic challenges of big data.<ref>Stanford. [https://web.stanford.edu/group/mmds/ "MMDS. Workshop on Algorithms for Modern Massive Data Sets"].</ref> Regarding big data, such concepts of magnitude are relative. As it is stated "If the past is of any guidance, then today's big data most likely will not be considered as such in the near future."<ref name=CAD7challenges/>

第678行：第851行：

现代海量数据集算法研讨会(MMDS)聚集了计算机科学家、统计学家、数学家和数据分析从业者，讨论大数据的算法挑战。斯坦福大学。“ MMDS。现代海量数据集算法研讨会”。对于大数据，这样的量级概念是相对的。正如文中所说: “如果说过去的数据有什么指导意义的话，那么今天的大数据在不久的将来很可能不会被认为是这样的。”

+

'''''【终译版】'''''。

===Sampling big data===

第685行：第860行：

关于大数据集，人们提出的一个研究问题是，是否有必要查看完整的数据，以便对数据的属性得出某些结论，或者样本是否足够好。大数据这个名称本身包含一个与规模相关的术语，这是大数据的一个重要特征。但是，抽样可以从较大的数据集中选择正确的数据点，以估计整个种群的特征。在制造不同类型的感官数据，如声学，振动，压力，电流，电压和控制器数据可在短时间间隔。为了预测停机时间，可能不需要查看所有的数据，但是一个样本就足够了。大数据可以按照不同的数据点分类，如人口统计学、心理学、行为学和交易数据。有了大量的数据点，营销人员就能够创造和使用更多的定制的消费者细分市场，从而实现更具战略性的目标。

+

'''''【终译版】'''''。

There has been some work done in sampling algorithms for big data. A theoretical formulation for sampling Twitter data has been developed.<ref>{{cite conference |author1=Deepan Palguna |author2= Vikas Joshi |author3=Venkatesan Chakravarthy |author4=Ravi Kothari |author5=L. V. Subramaniam |name-list-style=amp | title=Analysis of Sampling Algorithms for Twitter | journal=[[International Joint Conference on Artificial Intelligence]] | year=2015 }}</ref>

第691行：第868行：

在大数据的抽样算法方面已经做了一些工作。已经开发了一个抽样 Twitter 数据的理论公式。

+

'''''【终译版】'''''。

==Critique==

第705行：第884行：

= = = 对大数据范式的批评 = = = “一个关键问题是，我们对导致出现大数据的典型网络特征的潜在经验微过程知之甚少。”斯奈德斯、马扎特和瑞普斯在他们的评论中指出，通常对数学性质做出的非常强有力的假设，可能根本不能反映微过程层面的真实情况。马克 · 格雷厄姆对克里斯 · 安德森断言大数据将意味着理论的终结提出了广泛的批评: 特别关注大数据必须始终与其社会、经济和政治背景相联系的概念。尽管企业投入了8位数和9位数的资金，从供应商和客户源源不断的信息中获取洞察力，但只有不到40% 的员工拥有足够成熟的流程和技能来做到这一点。《哈佛商业评论》(Harvard Business Review)的一篇文章指出，为了克服这种洞察力不足，无论大数据分析得多么全面，多么精确，都必须辅之以“大判断力”。

+

'''''【终译版】'''''。

Much in the same line, it has been pointed out that the decisions based on the analysis of big data are inevitably "informed by the world as it was in the past, or, at best, as it currently is".<ref name="HilbertBigData2013">Hilbert, M. (2016). Big Data for Development: A Review of Promises and Challenges. Development Policy Review, 34(1), 135–174. https://doi.org/10.1111/dpr.12142 free access: https://www.martinhilbert.net/big-data-for-development/</ref> Fed by a large number of data on past experiences, algorithms can predict future development if the future is similar to the past.<ref name="HilbertTEDx">[https://www.youtube.com/watch?v=UXef6yfJZAI Big Data requires Big Visions for Big Change.], Hilbert, M. (2014). London: TEDx UCL, x=independently organized TED talks</ref> If the system's dynamics of the future change (if it is not a [[stationary process]]), the past can say little about the future. In order to make predictions in changing environments, it would be necessary to have a thorough understanding of the systems dynamic, which requires theory.<ref name="HilbertTEDx"/> As a response to this critique Alemany Oliver and Vayre suggest to use "abductive reasoning as a first step in the research process in order to bring context to consumers' digital traces and make new theories emerge".<ref>{{cite journal|last=Alemany Oliver|first=Mathieu |author2=Vayre, Jean-Sebastien |s2cid=111360835 |title= Big Data and the Future of Knowledge Production in Marketing Research: Ethics, Digital Traces, and Abductive Reasoning|journal=Journal of Marketing Analytics |year=2015|volume=3|issue=1|doi= 10.1057/jma.2015.1|pages=5–13}}</ref>

第713行：第895行：

与此类似，有人指出，基于大数据分析的决策不可避免地“受到过去世界的影响，或者充其量受到现在世界的影响”。希尔伯特(2016)。大数据促进发展: 承诺与挑战述评。发展政策检讨，34(1) ，135-174。Https://doi.org/10.1111/dpr.12142免费访问: 由过去经验的大量数据提供的 https://www.martinhilbert.net/big-data-for-development/ ，算法可以预测未来的发展，如果未来类似于过去。大数据需要大变化的远见，希尔伯特，m. (2014)。伦敦: TEDx 伦敦大学学院，x = 独立组织的 TED 演讲如果系统对未来的动态变化(如果不是一个平稳过程) ，过去对未来的影响微乎其微。为了在不断变化的环境中做出预测，需要对系统的动态性有一个透彻的理解，这需要理论。作为对这种批评的回应，Alemany Oliver 和 Vayre 建议使用“溯因推理作为研究过程中的第一步，以便为消费者的数字痕迹提供背景，并产生新的理论”。此外，有人建议将大数据方法与计算机模拟相结合，如基于主体的模型和复杂系统。基于代理的模型越来越能够通过基于一组相互依赖的算法的计算机模拟来预测未来未知情况下的社会复杂性的结果。爱泼斯坦，j. m. ，& Axtell，r. l. (1996)。成长中的人工社会: 自下而上的社会科学。一本布拉德福德的书。最后，使用多变量方法探测数据的潜在结构，如因子分析和数据聚类分析，已被证明是有用的分析方法，远远超出了双变量方法(例如:。列联表)通常用于较小的数据集。

+

'''''【终译版】'''''。

In health and biology, conventional scientific approaches are based on experimentation. For these approaches, the limiting factor is the relevant data that can confirm or refute the initial hypothesis.<ref>{{cite web|url=http://www.bigdataparis.com/documents/Pierre-Delort-INSERM.pdf#page=5| title=Delort P., Big data in Biosciences, Big Data Paris, 2012|website =Bigdataparis.com |access-date=8 October 2017}}</ref>

第721行：第905行：

在健康和生物学领域，传统的科学方法是建立在实验的基础上的。对于这些方法，限制因素是相关的数据，可以证实或反驳最初的假设。生物科学现在接受了一个新的假设: 没有事先假设的大量数据(组学)所提供的信息是互补的，有时是基于实验的传统方法所必需的。在大量的方法中，它是一个相关假设的表述，以解释数据，这是限制因素。搜索的逻辑是颠倒的，归纳法的局限性(“科学的荣耀与哲学的丑闻”，C.d. 布罗德，1926)是需要考虑的。

+

'''''【终译版】'''''。

[[Consumer privacy|Privacy]] advocates are concerned about the threat to privacy represented by increasing storage and integration of [[personally identifiable information]]; expert panels have released various policy recommendations to conform practice to expectations of privacy.<ref>{{cite magazine |first=Paul |last=Ohm |title=Don't Build a Database of Ruin |magazine=Harvard Business Review |url=http://blogs.hbr.org/cs/2012/08/dont_build_a_database_of_ruin.html|date=23 August 2012 }}</ref> The misuse of big data in several cases by media, companies, and even the government has allowed for abolition of trust in almost every fundamental institution holding up society.<ref>Bond-Graham, Darwin (2018). [https://www.theperspective.com/debates/the-perspective-on-big-data/ "The Perspective on Big Data"]. [[The Perspective]].</ref>

第727行：第913行：

隐私权倡导者担心隐私权受到威胁，这种威胁表现在个人身份信息的存储和整合不断增加; 专家小组已经发布了各种政策建议，使实践符合隐私权的期望。媒体、公司甚至政府在几个案例中滥用大数据，导致几乎所有支撑社会的基础机构都失去了信任。邦德-格雷厄姆，达尔文(2018)。“大数据透视”。透视法。

+

'''''【终译版】'''''。

Nayef Al-Rodhan argues that a new kind of social contract will be needed to protect individual liberties in the context of big data and giant corporations that own vast amounts of information, and that the use of big data should be monitored and better regulated at the national and international levels.<ref>{{Cite news|url=http://hir.harvard.edu/the-social-contract-2-0-big-data-and-the-need-to-guarantee-privacy-and-civil-liberties/|title=The Social Contract 2.0: Big Data and the Need to Guarantee Privacy and Civil Liberties – Harvard International Review|last=Al-Rodhan|first=Nayef|date=16 September 2014|work=Harvard International Review|access-date=3 April 2017|archive-url=https://web.archive.org/web/20170413090835/http://hir.harvard.edu/the-social-contract-2-0-big-data-and-the-need-to-guarantee-privacy-and-civil-liberties/|archive-date=13 April 2017|url-status=dead}}</ref> Barocas and Nissenbaum argue that one way of protecting individual users is by being informed about the types of information being collected, with whom it is shared, under what constraints and for what purposes.<ref>{{Cite book|title=Big Data's End Run around Anonymity and Consent| last1 =Barocas |first1=Solon |last2=Nissenbaum |first2=Helen|last3=Lane|first3=Julia|last4=Stodden|first4=Victoria|last5=Bender|first5=Stefan|last6=Nissenbaum|first6=Helen| s2cid =152939392|date=June 2014| publisher =Cambridge University Press|isbn=9781107067356|pages=44–75|doi =10.1017/cbo9781107590205.004}}</ref>

第733行：第921行：

纳耶夫 · 阿尔罗德汉认为，在拥有大量信息的大数据和巨型公司的背景下，需要一种新型的社会契约来保护个人自由，大数据的使用应该在国家和国际层面受到监督和更好的管理。巴罗卡斯和尼森鲍姆认为，保护个人用户的一种方法是了解收集的信息类型、与谁共享、受到何种限制以及用于何种目的。

+

'''''【终译版】'''''。

===Critiques of the "V" model===

第752行：第942行：

* 解释性和可解释性: 人类渴望理解和接受他们所理解的东西，而算法不能处理这个

* 自动决策层: 支持自动决策和自我学习的算法

+

'''''【终译版】'''''。

===Critiques of novelty===

第759行：第952行：

= = = 对新奇性的批评 = = = 大型数据集已经通过计算机进行了一个多世纪的分析，包括美国人口普查分析，由 IBM 的打孔卡片机进行，计算统计数据，包括整个大陆人口的均值和方差。近几十年来，欧洲核子研究中心(CERN)等科学实验所产生的数据规模与当前的商业“大数据”类似。然而，科学实验倾向于使用专门定制的高性能计算(超级计算)集群和网格来分析数据，而不是像当前商业浪潮中那样使用廉价的商品计算机云，这意味着文化和技术层面的差异。

+

'''''【终译版】'''''。

===Critiques of big data execution===

第768行：第963行：

= = = 对大数据执行的批评 = = = Ulf-Dietrich Reips 和 Uwe Matzat 在2014年写道，大数据已经成为科学研究的“时尚”。研究人员 danah boyd 对大数据在科学中的应用表示担忧，他忽视了一些原则，比如过于关注海量数据的处理而选择了具有代表性的样本。这种方法可能会导致在某种程度上存在偏见的结果。跨越不同种类的数据资源(有些可能被认为是大数据，有些则不是)的整合带来了巨大的逻辑和分析挑战，但许多研究人员认为，这种整合可能代表了科学界最有前途的新领域。在这篇颇具煽动性的文章《大数据的关键问题》(Critical Questions for Big Data)中，作者将大数据称为神话的一部分: “大数据集提供了更高形式的智力和知识[ ... ... ] ，带有真实、客观和准确的光环。”。大数据的使用者往往”迷失在庞大的数字中”，而且”使用大数据仍然是主观的，它量化的东西不一定能够更接近客观事实”。BI 领域的最新发展，例如前瞻性报告，特别是通过自动过滤非有用数据和相关性提高大数据的可用性。发布失败: 从大数据到重大决策，Forte Wares。大结构充满了虚假的相关性，要么是由于非因果巧合(真正的大数定律) ，大随机结构和算法(拉姆齐理论)的唯一性，要么是由于非包含因素的存在，因此，早期实验者使大型数据库“为自己说话”和革命性的科学方法的希望受到了质疑。克里斯蒂安 · s · 卡劳德，朱塞佩 · 隆戈，(2016) ，《大数据中伪相关性的泛滥》，《科学基础》

+

'''''【终译版】'''''。

Big data analysis is often shallow compared to analysis of smaller data sets.<ref name="kdnuggets-berchthold">{{cite web|url=http://www.kdnuggets.com/2014/08/interview-michael-berthold-knime-research-big-data-privacy-part2.html|title=Interview: Michael Berthold, KNIME Founder, on Research, Creativity, Big Data, and Privacy, Part 2|date=12 August 2014|author=Gregory Piatetsky| author-link= Gregory I. Piatetsky-Shapiro|publisher=KDnuggets|access-date=13 August 2014}}</ref> In many big data projects, there is no large data analysis happening, but the challenge is the [[extract, transform, load]] part of data pre-processing.<ref name="kdnuggets-berchthold" />

第774行：第971行：

大数据分析与小数据集分析相比往往是肤浅的。在许多大数据项目中，没有大型的数据分析发生，但是挑战在于提取、转换和加载数据预处理数据的部分。

+

'''''【终译版】'''''。

Big data is a [[buzzword]] and a "vague term",<ref>{{cite news|last1=Pelt|first1=Mason|title="Big Data" is an over used buzzword and this Twitter bot proves it|url= http://siliconangle.com/blog/2015/10/26/big-data-is-an-over-used-buzzword-and-this-twitter-bot-proves-it/ |newspaper=Siliconangle|access-date=4 November 2015|date=26 October 2015}}</ref><ref name="ft-harford">{{cite web |url=http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html |title=Big data: are we making a big mistake? |last1=Harford |first1=Tim |date=28 March 2014 |website=[[Financial Times]] |access-date=7 April 2014}}</ref> but at the same time an "obsession"<ref name="ft-harford" /> with entrepreneurs, consultants, scientists, and the media. Big data showcases such as [[Google Flu Trends]] failed to deliver good predictions in recent years, overstating the flu outbreaks by a factor of two. Similarly, [[Academy awards]] and election predictions solely based on Twitter were more often off than on target.

第790行：第989行：

大数据是一个时髦词汇和“模糊词汇”，但同时也是企业家、咨询师、科学家和媒体的“迷恋”。像谷歌流感趋势这样的大数据展示在最近几年未能提供好的预测，将流感爆发夸大了两倍。同样，仅仅基于推特的奥斯卡奖和选举预测往往不准确。大数据往往会带来与小数据相同的挑战; 增加更多的数据并不能解决偏差问题，但可能会强调其他问题。特别是像推特这样的数据来源并不能代表整个人口，从这些来源得出的结果可能会导致错误的结论。基于大数据文本统计分析的谷歌翻译(Google translate)在网页翻译方面做得很好。然而，来自专门领域的结果可能被严重扭曲。另一方面，大数据也可能引入新的问题，比如多重比较问题: 同时测试大量假设可能会产生许多错误的结果，错误地显得意义重大。约阿尼迪斯认为，“大多数已发表的研究结果都是错误的”，其原因基本上是相同的: 当许多科学团队和研究人员各自进行许多实验(即。处理大量的科学数据，尽管不是使用大数据技术) ，“显著”结果是错误的可能性快速增长——当只有正面的结果被公布时，这种可能性更大。此外，大数据分析的结果只能和它们所预测的模型一样好。举个例子，大数据试图预测2016年美国总统大选的结果，但却取得了不同程度的成功。

+

'''''【终译版】'''''。

=== Critiques of big data policing and surveillance ===

第805行：第1,006行：

* Increasing the scope and number of people that are subject to law enforcement tracking and exacerbating existing racial overrepresentation in the criminal justice system

* Encouraging members of society to abandon interactions with institutions that would create a digital trace, thus creating obstacles to social inclusion

−

* 利用一种数学的、因此是不偏不倚的算法理由，加强对犯罪嫌疑人的监视

第816行：第1,016行：

如果这些潜在的问题得不到纠正或规范，大数据监管的影响可能会继续塑造社会等级。布莱恩还指出，尽责地使用大数据监管可以防止个人层面的偏见成为制度层面的偏见。

+

'''''【终译版】'''''。

==In popular culture==

第823行：第1,025行：

*Moneyball is a non-fiction book that explores how the Oakland Athletics used statistical analysis to outperform teams with larger budgets. In 2011 a film adaptation starring Brad Pitt was released.

−

~~= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =~~ 《点球成金》是一本非小说类书籍，书中探讨了奥克兰运动家是如何利用统计分析来超越那些预算较大的团队的。2011年，由布拉德 · 皮特主演的改编电影上映。

+

《点球成金》是一本非小说类书籍，书中探讨了奥克兰运动家是如何利用统计分析来超越那些预算较大的团队的。2011年，由布拉德 · 皮特主演的改编电影上映。

+

'''''【终译版】'''''。

===Film===

第832行：第1,036行：

*In The Dark Knight, Batman uses a sonar device that can spy on all of Gotham City. The data is gathered from the mobile phones of people within the city.

−

~~= = = = =~~

* 美国队长: 冬兵》(Captain America: The Winter Soldier)中，H.Y.D.R.A (伪装成神盾局)开发了一种利用数据来确定和消除全球威胁的飞行母舰。

* 在《蝙蝠侠: 黑暗骑士》中，蝙蝠侠使用了一种可以监视整个哥谭市的声纳设备。这些数据是通过城市里人们的手机收集的。

+

'''''【终译版】'''''。

== See also ==

−

~~== See also ==~~

+

= 参见 =

−

= = 参见 = =

{{Category see also|LABEL=For a list of companies, and tools, see also|Big data}}

L（吕奥博）

35

个编辑

更改

大数据 (查看源代码)

2022年1月25日 (二) 10:36的版本

导航菜单

搜索