添加230,992字节
、 2022年1月19日 (三) 17:33
此词条暂由彩云小译翻译,翻译字数共8746,未经人工整理和审校,带来阅读不便,请见谅。
{{Short description|Information assets characterized by high volume, velocity, and variety}}
{{About|large collections of data|the band|Big Data (band)|the practice of buying and selling of personal and consumer data|Surveillance capitalism}}
{{Use dmy dates|date=January 2020}}
[[File:Hilbert InfoGrowth.png|thumb|right|400px|Non-linear growth of digital global information-storage capacity and the waning of analog storage<ref>{{cite journal|url= http://www.martinhilbert.net/WorldInfoCapacity.html|title= The World's Technological Capacity to Store, Communicate, and Compute Information|volume= 332|issue= 6025|pages= 60–65|journal=Science|access-date= 13 April 2016|bibcode= 2011Sci...332...60H|last1= Hilbert|first1= Martin|last2= López|first2= Priscila|year= 2011|doi= 10.1126/science.1200970|pmid= 21310967|s2cid= 206531385}}</ref>]]
thumb|right|400px|Non-linear growth of digital global information-storage capacity and the waning of analog storage
拇指 | 右 | 400px | 全球数字信息存储容量的非线性增长和模拟存储的减弱
'''Big data''' is a field that treats ways to analyze, systematically extract information from, or otherwise deal with [[data set]]s that are too large or complex to be dealt with by traditional [[data processing|data-processing]] [[application software]]. Data with many fields (columns) offer greater [[statistical power]], while data with higher complexity (more attributes or columns) may lead to a higher [[false discovery rate]].<ref>{{Cite journal|last=Breur|first=Tom|date=July 2016|title=Statistical Power Analysis and the contemporary "crisis" in social sciences|journal=Journal of Marketing Analytics |publisher=[[Palgrave Macmillan]]|location=London, England|volume=4 |issue=2–3 |pages=61–65 |doi=10.1057/s41270-016-0001-3 |issn=2050-3318|doi-access=free}}</ref> Big data analysis challenges include [[Automatic identification and data capture|capturing data]], [[Computer data storage|data storage]], [[data analysis]], search, [[Data sharing|sharing]], [[Data transmission|transfer]], [[Data visualization|visualization]], [[Query language|querying]], updating, [[information privacy]], and data source. Big data was originally associated with three key concepts: ''volume'', ''variety'', and ''velocity''.<ref name=":0" /> The analysis of big data presents challenges in sampling, and thus previously allowing for only observations and sampling. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and ''value''.
Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many fields (columns) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate. Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. The analysis of big data presents challenges in sampling, and thus previously allowing for only observations and sampling. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.
大数据是一个研究如何分析、系统地从中提取信息或以其他方式处理传统数据处理应用软件无法处理的过于庞大或复杂的数据集的领域。具有多个字段(列)的数据提供了更强的统计能力,而具有更高复杂性(更多属性或列)的数据可能导致更高的错误发现率。大数据分析面临的挑战包括捕获数据、数据存储、数据分析、搜索、共享、传输、可视化、查询、更新、信息隐私和数据源。大数据最初与三个关键概念有关: 数量、多样性和速度。大数据的分析在取样方面提出了挑战,因此以前只允许观测和取样。因此,大数据通常包含的数据大小超过了传统软件在可接受的时间和价值内处理的能力。
Current usage of the term ''big data'' tends to refer to the use of [[predictive analytics]], [[user behavior analytics]], or certain other advanced data analytics methods that extract [[Data valuation|value]] from big data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that's not the most relevant characteristic of this new data ecosystem."<ref>{{cite journal |last1=boyd |first1=dana |last2=Crawford |first2=Kate |title=Six Provocations for Big Data |journal=Social Science Research Network: A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society |date=21 September 2011 |doi= 10.2139/ssrn.1926431|s2cid=148610111 |url=http://osf.io/nrjhn/ }}</ref>
Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on".{{r|Economist}} Scientists, business executives, medical practitioners, advertising and [[Government database|governments]] alike regularly meet difficulties with large data-sets in areas including [[Web search engine|Internet searches]], [[fintech]], healthcare analytics, geographic information systems, urban informatics, and [[business informatics]]. Scientists encounter limitations in [[e-Science]] work, including [[meteorology]], [[genomics]],<ref>{{cite journal | title = Community cleverness required | journal = Nature | volume = 455 | issue = 7209 | pages = 1 | date = September 2008 | pmid = 18769385 | doi = 10.1038/455001a | bibcode = 2008Natur.455....1. | doi-access = free }}</ref> [[connectomics]], complex physics simulations, biology, and environmental research.<ref>{{cite journal | vauthors = Reichman OJ, Jones MB, Schildhauer MP | title = Challenges and opportunities of open data in ecology | journal = Science | volume = 331 | issue = 6018 | pages = 703–5 | date = February 2011 | pmid = 21311007 | doi = 10.1126/science.1197962 | bibcode = 2011Sci...331..703R | s2cid = 22686503 | url = https://escholarship.org/uc/item/7627s45z }}</ref>
Current usage of the term big data tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from big data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that's not the most relevant characteristic of this new data ecosystem."
Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on". Scientists, business executives, medical practitioners, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet searches, fintech, healthcare analytics, geographic information systems, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations, biology, and environmental research.
目前大数据这个术语的使用倾向于使用预测分析分析,用户行为分析,或者其他一些高级的数据分析方法,这些方法从大数据中提取价值,很少使用特定规模的数据集。“毫无疑问,现在可用的数据量确实很大,但这不是这个新数据生态系统最相关的特征。”对数据集的分析可以发现与“现场业务趋势、预防疾病、打击犯罪等”的新关联。科学家、企业管理人员、医疗从业人员、广告业者和政府都经常遇到大型数据集的困难,这些数据集涉及互联网搜索、金融技术、医疗保健分析、地理信息系统、城市信息学和经济信息学。科学家在电子科学工作中遇到了一些限制,包括气象学、基因组学、连接组学、复杂的物理模拟、生物学和环境研究。
The size and number of available data sets have grown rapidly as data is collected by devices such as [[mobile device]]s, cheap and numerous information-sensing [[Internet of things]] devices, aerial ([[remote sensing]]), software logs, [[Digital camera|cameras]], microphones, [[radio-frequency identification]] (RFID) readers and [[wireless sensor networks]].<ref>{{cite web |author= Hellerstein, Joe |title= Parallel Programming in the Age of Big Data |date= 9 November 2008 |work= Gigaom Blog |url= http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/}}</ref><ref>{{cite book |first1= Toby |last1= Segaran |first2= Jeff |last2= Hammerbacher |title= Beautiful Data: The Stories Behind Elegant Data Solutions |url= https://books.google.com/books?id=zxNglqU1FKgC |year= 2009 |publisher= O'Reilly Media |isbn= 978-0-596-15711-1 |page= 257}}</ref> The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;<ref name="martinhilbert.net">{{cite journal | vauthors = Hilbert M, López P | title = The world's technological capacity to store, communicate, and compute information | journal = Science | volume = 332 | issue = 6025 | pages = 60–5 | date = April 2011 | pmid = 21310967 | doi = 10.1126/science.1200970 | url = http://www.uvm.edu/pdodds/files/papers/others/2011/hilbert2011a.pdf | bibcode = 2011Sci...332...60H | s2cid = 206531385 }}</ref> {{As of|2012|lc=on}}, every day 2.5 [[exabyte]]s (2.5×2<sup>60</sup> bytes) of data are generated.<ref>{{cite web|url= http://www.ibm.com/big-data/us/en/ |title= IBM What is big data? – Bringing big data to the enterprise |publisher= ibm.com |access-date= 26 August 2013}}</ref> Based on an [[International Data Corporation|IDC]] report prediction, the global data volume was predicted to grow exponentially from 4.4 [[zettabyte]]s to 44 zettabytes between 2013 and 2020. By 2025, IDC predicts there will be 163 zettabytes of data.<ref>{{Cite web| url=https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf| title=Data Age 2025: The Evolution of Data to Life-Critical|last1=Reinsel|first1=David|last2=Gantz|first2=John|date=13 April 2017|website=seagate.com|publisher=[[International Data Corporation]]|location=Framingham, MA, US|access-date=2 November 2017|last3=Rydning|first3=John}}</ref> One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.<ref>Oracle and FSN, [http://www.fsn.co.uk/channel_bi_bpm_cpm/mastering_big_data_cfo_strategies_to_transform_insight_into_opportunity "Mastering Big Data: CFO Strategies to Transform Insight into Opportunity"] {{Webarchive|url=https://web.archive.org/web/20130804062518/http://www.fsn.co.uk/channel_bi_bpm_cpm/mastering_big_data_cfo_strategies_to_transform_insight_into_opportunity |date=4 August 2013 }}, December 2012</ref>
The size and number of available data sets have grown rapidly as data is collected by devices such as mobile devices, cheap and numerous information-sensing Internet of things devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks. The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; , every day 2.5 exabytes (2.5×260 bytes) of data are generated. Based on an IDC report prediction, the global data volume was predicted to grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013 and 2020. By 2025, IDC predicts there will be 163 zettabytes of data. One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.Oracle and FSN, "Mastering Big Data: CFO Strategies to Transform Insight into Opportunity" , December 2012
随着移动设备、廉价且数量众多的信息感知物联网设备、天线(遥感)、软件日志、相机、麦克风、射频识别读取器和无线传感器网络等设备收集数据,可用数据集的规模和数量迅速增长。自20世纪80年代以来,世界人均存储信息的技术容量大约每40个月翻一番; 每天产生2.5艾字节(2.5 × 260字节)的数据。根据 IDC 的报告预测,全球数据量将在2013年到2020年间成倍增长,从4.4 zettabytes 增长到44 zettabytes。国际数据公司预测,到2025年,将有163兆字节的数据。对于大型企业来说,一个问题是确定谁应该拥有影响整个组织的大数据计划。Oracle 和 FSN,“ Mastering Big Data: CFO Strategies to Transform Insight into Opportunity”,December 2012
[[Relational database management system]]s and desktop statistical software packages used to visualize data often have difficulty processing and analyzing big data. The processing and analysis of big data may require "massively parallel software running on tens, hundreds, or even thousands of servers".<ref>{{cite web |author= Jacobs, A. |title= The Pathologies of Big Data |date= 6 July 2009 |work= ACMQueue |url= http://queue.acm.org/detail.cfm?id=1563874}}</ref> What qualifies as "big data" varies depending on the capabilities of those analyzing it and their tools. Furthermore, expanding capabilities make big data a moving target. "For some organizations, facing hundreds of [[gigabyte]]s of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."<ref>{{cite journal|last1=Magoulas|first1=Roger|last2=Lorica|first2=Ben|date=February 2009|title=Introduction to Big Data|url=https://academics.uccs.edu/~ooluwada/courses/datamining/ExtraReading/BigData|journal=Release 2.0|location=Sebastopol CA|publisher=O'Reilly Media|issue=11}}</ref>
Relational database management systems and desktop statistical software packages used to visualize data often have difficulty processing and analyzing big data. The processing and analysis of big data may require "massively parallel software running on tens, hundreds, or even thousands of servers". What qualifies as "big data" varies depending on the capabilities of those analyzing it and their tools. Furthermore, expanding capabilities make big data a moving target. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
用于数据可视化的关系数据库管理系统和桌面统计软件包通常难以处理和分析大数据。大数据的处理和分析可能需要“运行在数十、数百甚至数千台服务器上的大规模并行处理机软件”。什么是“大数据”取决于那些分析它的人和他们的工具的能力。此外,不断扩大的能力使得大数据成为一个移动的目标。”对于一些组织来说,第一次面对数百千兆字节的数据可能需要重新考虑数据管理选项。对于其他人来说,数据大小可能需要几十或几百万兆字节才能成为一个重要的考虑因素。”
==Definition==
The term ''big data'' has been in use since the 1990s, with some giving credit to [[John Mashey]] for popularizing the term.<ref>{{Cite web |title= Big Data ... and the Next Wave of InfraStress |author= John R. Mashey |date= 25 April 1998 |publisher= Usenix |work= Slides from invited talk |url= http://static.usenix.org/event/usenix99/invited_talks/mashey.pdf |access-date= 28 September 2016 }}</ref><ref>{{cite news|title=The Origins of 'Big Data': An Etymological Detective Story |author=Steve Lohr |date= 1 February 2013 |url=http://bits.blogs.nytimes.com/2013/02/01/the-origins-of-big-data-an-etymological-detective-story/ |work= [[The New York Times]] |access-date= 28 September 2016 }}</ref>
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to [[data acquisition|capture]], [[data curation|curate]], manage, and process data within a tolerable elapsed time.<ref name="Editorial">{{cite journal | last1 = Snijders | first1 = C. | last2 = Matzat | first2 = U. | last3 = Reips | first3 = U.-D. | year = 2012 | title = 'Big Data': Big gaps of knowledge in the field of Internet | url = http://www.ijis.net/ijis7_1/ijis7_1_editorial.html | journal = International Journal of Internet Science | volume = 7 | pages = 1–5 }}</ref> Big data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data.<ref name="Springer 2017">{{cite book |chapter=Towards Differentiating Business Intelligence, Big Data, Data Analytics and Knowledge Discovery |last1=Dedić |first1=N. |title=Innovations in Enterprise Information Systems Management and Engineering |last2=Stanier |first2=C. |issn=1865-1356 |oclc=909580101 |publisher=Springer International Publishing |location=Berlin ; Heidelberg |year=2017 |volume= 285|pages=114–122 |doi=10.1007/978-3-319-58801-8_10 |series=Lecture Notes in Business Information Processing |isbn=978-3-319-58800-1 |chapter-url=http://eprints.staffs.ac.uk/3551/1/Towards%20Differentiating%20Business%20Intelligence%20Big%20Data%20Data%20Analytics%20and%20Knowldge%20Discovery.docx }}</ref> Big data "size" is a constantly moving target; {{As of|2012|lc=on}} ranging from a few dozen terabytes to many [[zettabyte]]s of data.<ref name="Everts">{{cite magazine|last1=Everts |first1=Sarah |title=Information Overload |magazine=[[Distillations (magazine)|Distillations]] |date=2016| volume=2|issue=2|pages=26–33|url =https://www.sciencehistory.org/distillations/magazine/information-overload| access-date=22 March 2018}}</ref>
Big data requires a set of techniques and technologies with new forms of [[data integration|integration]] to reveal insights from [[Data set|data-sets]] that are diverse, complex, and of a massive scale.<ref>{{cite journal | last1 = Ibrahim | last2 = Targio Hashem | first2 = Abaker | last3 = Yaqoob | first3 = Ibrar | last4 = Badrul Anuar | first4 = Nor | last5 = Mokhtar | first5 = Salimah | last6 = Gani | first6 = Abdullah | last7 = Ullah Khan | first7 = Samee | year = 2015 | title = big data" on cloud computing: Review and open research issues | journal = Information Systems | volume = 47 | pages = 98–115 | doi = 10.1016/j.is.2014.07.006 }}</ref>
The term big data has been in use since the 1990s, with some giving credit to John Mashey for popularizing the term.
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data. Big data "size" is a constantly moving target; ranging from a few dozen terabytes to many zettabytes of data.
Big data requires a set of techniques and technologies with new forms of integration to reveal insights from data-sets that are diverse, complex, and of a massive scale.
= = 定义 = = 大数据这个术语从1990年代就开始使用了,有些人认为是约翰 · 马歇推广了这个术语。大数据通常包括大小超出常用软件工具能力的数据集,这些软件工具可以在可承受的时间内捕获、管理和处理数据。大数据哲学包括非结构化、半结构化和结构化数据,但主要关注的是非结构化数据数据。大数据“大小”是一个不断变化的目标; 从几十 tb 到许多 ztabytes 的数据不等。大数据需要一系列技术和新的集成形式,以揭示来自多样化、复杂和大规模数据集的洞察力。
"Variety", "veracity", and various other "Vs" are added by some organizations to describe it, a revision challenged by some industry authorities.<ref>{{cite magazine|last=Grimes|first=Seth|title=Big Data: Avoid 'Wanna V' Confusion| url=http://www.informationweek.com/big-data/big-data-analytics/big-data-avoid-wanna-v-confusion/d/d-id/1111077|magazine=[[InformationWeek]]|access-date = 5 January 2016}}</ref> The Vs of big data were often referred to as the "three Vs", "four Vs", and "five Vs". They represented the qualities of big data in volume, variety, velocity, [[veracity (data)|veracity]], and value.<ref name=":0">{{Cite web|date=2016-09-17|title=The 5 V's of big data|url=https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data/|access-date=2021-01-20|website=Watson Health Perspectives|language=en-US}}</ref> Variability is often included as an additional quality of big data.
"Variety", "veracity", and various other "Vs" are added by some organizations to describe it, a revision challenged by some industry authorities. The Vs of big data were often referred to as the "three Vs", "four Vs", and "five Vs". They represented the qualities of big data in volume, variety, velocity, veracity, and value. Variability is often included as an additional quality of big data.
一些组织添加了“多样性”、“准确性”和其他各种“ v”来描述它,这个修订受到了一些行业权威的质疑。大数据 Vs 通常被称为“三个 Vs”、“四个 Vs”和“五个 Vs”。它们在数量、多样性、速度、准确性和价值等方面代表了大数据的特性。可变性通常作为大数据的附加质量被包括在内。
A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and notes, "This represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of some of the guarantees and capabilities made by [[Relational database|Codd's relational model]]."<ref>{{Cite book|last=Fox|first=Charles|date=25 March 2018|title=Data Science for Transport| url=https://www.springer.com/us/book/9783319729527|publisher=Springer|isbn=9783319729527|series=Springer Textbooks in Earth Sciences, Geography and Environment}}</ref>
A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and notes, "This represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of some of the guarantees and capabilities made by Codd's relational model."
2018年的一个定义指出“大数据是需要并行计算工具来处理数据的地方”,并指出,“这代表了通过并行编程理论使用的计算机科学发生了一个明显而清晰的变化,以及 Codd 的关系模型数据库所做出的一些保证和能力的丧失。”
In a comparative study of big datasets, [[Rob Kitchin|Kitchin]] and McArdle found that none of the commonly considered characteristics of big data appear consistently across all of the analyzed cases.<ref>{{cite journal | last1 = Kitchin | first1 = Rob | last2 = McArdle | first2 = Gavin | year = 2016 | title = What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets | journal = Big Data & Society | volume = 3 | pages = 1–10 | doi = 10.1177/2053951716631130 | s2cid = 55539845 }}</ref> For this reason, other studies identified the redefinition of power dynamics in knowledge discovery as the defining trait.<ref>{{cite journal | last1 = Balazka | first1 = Dominik | last2 = Rodighiero | first2 = Dario | year = 2020 | title = Big Data and the Little Big Bang: An Epistemological (R)evolution | journal = Frontiers in Big Data | volume = 3 | page = 31 | doi = 10.3389/fdata.2020.00031 | pmid = 33693404 | pmc = 7931920 | hdl = 1721.1/128865 | hdl-access = free | doi-access = free }}</ref> Instead of focusing on intrinsic characteristics of big data, this alternative perspective pushes forward a relational understanding of the object claiming that what matters is the way in which data is collected, stored, made available and analyzed.
In a comparative study of big datasets, Kitchin and McArdle found that none of the commonly considered characteristics of big data appear consistently across all of the analyzed cases. For this reason, other studies identified the redefinition of power dynamics in knowledge discovery as the defining trait. Instead of focusing on intrinsic characteristics of big data, this alternative perspective pushes forward a relational understanding of the object claiming that what matters is the way in which data is collected, stored, made available and analyzed.
在对大数据集的比较研究中,Kitchin 和 McArdle 发现,在所有分析的案例中,大数据通常被认为的特征没有一个是一致的。因此,其他研究将知识发现中权力动力学的重新定义确定为知识发现的定义特征。这种不同的视角不是关注大数据的内在特征,而是推动了对对象的关系理解,声称重要的是数据收集、存储、提供和分析的方式。
=== Big data vs. business intelligence ===
The growing maturity of the concept more starkly delineates the difference between "big data" and "[[business intelligence]]":<ref>{{cite web| url =http://www.bigdataparis.com/presentation/mercredi/PDelort.pdf?PHPSESSID=tv7k70pcr3egpi2r6fi3qbjtj6#page=4 |format=PDF|title=avec focalisation sur Big Data & Analytique |website=Bigdataparis.com|access-date=8 October 2017}}</ref>
* Business intelligence uses applied mathematics tools and [[descriptive statistics]] with data with high information density to measure things, detect trends, etc.
* Big data uses mathematical analysis, optimization, [[inductive statistics]], and concepts from [[nonlinear system identification]]<ref name="SAB1">Billings S.A. "Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains". Wiley, 2013</ref> to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density<ref>{{cite web|url=http://www.andsi.fr/tag/dsi-big-data/|title=le Blog ANDSI » DSI Big Data| website=Andsi.fr |access-date=8 October 2017}}</ref> to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors.<ref name="SAB1" /><ref>{{cite web|url=http://lecercle.lesechos.fr/entrepreneur/tendances-innovation/221169222/big-data-low-density-data-faible-densite-information-com|title=Les Echos – Big Data car Low-Density Data ? La faible densité en information comme facteur discriminant – Archives|author=Les Echos|date=3 April 2013|website=Lesechos.fr|access-date=8 October 2017}}</ref>{{promotional source|date=December 2018}}
The growing maturity of the concept more starkly delineates the difference between "big data" and "business intelligence":
* Business intelligence uses applied mathematics tools and descriptive statistics with data with high information density to measure things, detect trends, etc.
* Big data uses mathematical analysis, optimization, inductive statistics, and concepts from nonlinear system identificationBillings S.A. "Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains". Wiley, 2013 to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors.
= = = 大数据 vs. 商业智能 = = = 大数据与商业智能的概念日益成熟,更加清晰地勾勒出”大数据”和”商业智能”之间的区别:
* 商业智能使用应用数学工具,描述统计学使用高信息密度的数据来衡量事物,检测趋势等等。
* 大数据使用数学分析、优化、归纳统计和概念从非线性识别比林斯公司“非线性系统辨识: NARMAX 方法在时间、频率和时空域”。Wiley,2013从低信息密度的大量数据中推断法则(回归、非线性关系和因果效应) ,以揭示关系和依赖性,或者执行结果和行为的预测。
==Characteristics==
[[File: Big Data.png|thumb|Shows the growth of big data's primary characteristics of volume, velocity, and variety]]
Big data can be described by the following characteristics:
thumb|Shows the growth of big data's primary characteristics of volume, velocity, and variety
Big data can be described by the following characteristics:
显示大数据在数量、速度和变化方面的主要特征大数据可以用以下特征来描述:
; Volume: The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. The size of big data is usually larger than terabytes and petabytes.<ref>{{cite journal |last1=Sagiroglu |first1=Seref |title=Big data: A review |journal=2013 International Conference on Collaboration Technologies and Systems (CTS) |date=2013 |pages=42–47 |doi=10.1109/CTS.2013.6567202|isbn=978-1-4673-6404-1 |s2cid=5724608 }}</ref>
; Volume: The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. The size of big data is usually larger than terabytes and petabytes.
数量: 生成和存储数据的数量。数据的大小决定了数据的价值和潜在洞察力,以及它是否可以被视为大数据。大数据的大小通常大于 tb 和 pb。
; Variety: The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. The big data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed (velocity), and huge in size (volume). Later, these tools and technologies were explored and used for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, sensors, etc. Big data draws from text, images, audio, video; plus it completes missing pieces through [[data fusion]].
; Variety: The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. The big data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed (velocity), and huge in size (volume). Later, these tools and technologies were explored and used for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, sensors, etc. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.
品种: 数据的类型和性质。早期的技术(如 rdbms)能够有效地处理结构化数据。然而,从结构化到半结构化或非结构化的类型和性质的变化对现有的工具和技术提出了挑战。大数据技术的发展主要是为了获取、存储和处理高速、大容量的半结构化和非结构化(变化)数据。后来,这些工具和技术也被用于处理结构化数据,但更适合存储。最终,结构化数据的处理仍然是可选的,要么使用大数据,要么使用传统的 rdbms。这有助于分析数据,从而有效利用通过社交媒体、日志文件、传感器等收集的数据中暴露出来的隐藏洞察力。大数据从文本、图像、音频、视频中提取,并通过数据融合完成缺失的部分。
; Velocity: The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to [[small data]], big data is produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.<ref>{{cite journal |last1=Kitchin |first1=Rob |last2=McArdle |first2=Gavin |title=What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets |journal=Big Data & Society |date=17 February 2016 |volume=3 |issue=1 |pages=205395171663113 |doi=10.1177/2053951716631130|doi-access=free }}</ref>
; Velocity: The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data is produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.
速度: 生成和处理数据以满足增长和发展道路上的需求和挑战的速度。大数据通常是实时的。与小数据相比,大数据的产生更加持续。与大数据相关的两种速度是生成频率和处理、记录和发布频率。
;Veracity: The truthfulness or reliability of the data, which refers to the data quality and the data value.<ref>{{Cite journal|last1=Onay|first1=Ceylan|last2=Öztürk|first2=Elif|date=2018|title=A review of credit scoring research in the age of Big Data|journal=Journal of Financial Regulation and Compliance|volume=26|issue=3|pages=382–405|doi=10.1108/JFRC-06-2017-0054|s2cid=158895306}}</ref> Big data must not only be large in size, but also must be reliable in order to achieve value in the analysis of it. The [[data quality]] of captured data can vary greatly, affecting an accurate analysis.<ref>[https://web.archive.org/web/20180731105912/https://spotlessdata.com/blog/big-datas-fourth-v Big Data's Fourth V]</ref>
;Veracity: The truthfulness or reliability of the data, which refers to the data quality and the data value. Big data must not only be large in size, but also must be reliable in order to achieve value in the analysis of it. The data quality of captured data can vary greatly, affecting an accurate analysis.Big Data's Fourth V
准确性: 数据的真实性或可靠性,是指数据的质量和数据的价值。大数据不仅要大,而且要可靠,才能在分析中获得价值。捕获的数据的数据质量会有很大的差异,影响准确的分析
; Value: The worth in information that can be achieved by the processing and analysis of large datasets. Value also can be measured by an assessment of the other qualities of big data.<ref>{{Cite web|title=Measuring the Business Value of Big Data {{!}} IBM Big Data & Analytics Hub|url=https://www.ibmbigdatahub.com/blog/measuring-business-value-big-data|access-date=2021-01-20|website=www.ibmbigdatahub.com}}</ref> Value may also represent the profitability of information that is retrieved from the analysis of big data.
; Value: The worth in information that can be achieved by the processing and analysis of large datasets. Value also can be measured by an assessment of the other qualities of big data. Value may also represent the profitability of information that is retrieved from the analysis of big data.
价值: 通过处理和分析大型数据集所能获得的信息价值。价值也可以通过评估大数据的其他特性来衡量。价值还可以表示从大数据分析中检索到的信息的利润率。
; Variability: The characteristic of the changing formats, structure, or sources of big data. Big data can include structured, unstructured, or combinations of structured and unstructured data. Big data analysis may integrate raw data from multiple sources. The processing of raw data may also involve transformations of unstructured data to structured data.
; Variability: The characteristic of the changing formats, structure, or sources of big data. Big data can include structured, unstructured, or combinations of structured and unstructured data. Big data analysis may integrate raw data from multiple sources. The processing of raw data may also involve transformations of unstructured data to structured data.
可变性: 大数据的格式、结构或来源不断变化的特征。大数据可以包括结构化、非结构化,或结构化和非结构化数据的组合。大数据分析可以整合来自多个来源的原始数据。对原始数据的处理也可能涉及到非结构化数据到结构化数据的转换。
Other possible characteristics of big data are:<ref>{{Cite journal|last1=Kitchin|first1=Rob|last2=McArdle|first2=Gavin|date=5 January 2016|title=What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets|journal=Big Data & Society|language=en|volume=3|issue=1|pages=205395171663113|doi=10.1177/2053951716631130|issn=2053-9517|doi-access=free}}</ref>
Other possible characteristics of big data are:
大数据的其他可能特征是:
;Exhaustive: Whether the entire system (i.e., <math display="inline">n</math>=all) is captured or recorded or not. Big data may or may not include all the available data from sources.
;Exhaustive: Whether the entire system (i.e., n=all) is captured or recorded or not. Big data may or may not include all the available data from sources.
详尽: 是否捕获或记录整个系统(即 n = all)。大数据可能包括也可能不包括所有来源的可用数据。
; Fine-grained and uniquely lexical: Respectively, the proportion of specific data of each element per element collected and if the element and its characteristics are properly indexed or identified.
; Fine-grained and uniquely lexical: Respectively, the proportion of specific data of each element per element collected and if the element and its characteristics are properly indexed or identified.
细粒度和唯一词法: 分别指收集的每个元素的特定数据与每个元素的比例,以及元素及其特征是否正确编制了索引或标识。
; Relational: If the data collected contains common fields that would enable a conjoining, or meta-analysis, of different data sets.
; Relational: If the data collected contains common fields that would enable a conjoining, or meta-analysis, of different data sets.
; Relational: 如果收集的数据包含公共字段,则可以对不同的数据集进行连接或元分析。
; Extensional: If new fields in each element of the data collected can be added or changed easily.
; Extensional: If new fields in each element of the data collected can be added or changed easily.
外延: 如果可以轻松地添加或更改收集的数据的每个元素中的新字段。
; Scalability: If the size of the big data storage system can expand rapidly.
; Scalability: If the size of the big data storage system can expand rapidly.
可扩展性: 如果大数据存储系统的规模能够迅速扩大。
==Architecture==
Big data repositories have existed in many forms, often built by corporations with a special need. Commercial vendors historically offered parallel database management systems for big data beginning in the 1990s. For many years, WinterCorp published the largest database report.<ref>{{cite web |url=http://www.eweek.com/database/survey-biggest-databases-approach-30-terabytes|title=Survey: Biggest Databases Approach 30 Terabytes|website=Eweek.com|date=8 November 2003|access-date=8 October 2017}}</ref>{{promotional source|date=December 2018}}
Big data repositories have existed in many forms, often built by corporations with a special need. Commercial vendors historically offered parallel database management systems for big data beginning in the 1990s. For many years, WinterCorp published the largest database report.
海量数据存储库以多种形式存在,通常由有特殊需求的企业构建。从20世纪90年代开始,商业供应商一直提供大数据的并行数据库管理系统。多年来,温特公司发布了最大的数据库报告。
[[Teradata]] Corporation in 1984 marketed the parallel processing [[DBC 1012]] system. Teradata systems were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were 2.5 GB in 1991 so the definition of big data continuously evolves. Teradata installed the first petabyte class RDBMS based system in 2007. {{as of|2017}}, there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50 PB. Systems up until 2008 were 100% structured relational data. Since then, Teradata has added unstructured data types including [[XML]], [[JSON]], and Avro.
Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata systems were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were 2.5 GB in 1991 so the definition of big data continuously evolves. Teradata installed the first petabyte class RDBMS based system in 2007. , there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50 PB. Systems up until 2008 were 100% structured relational data. Since then, Teradata has added unstructured data types including XML, JSON, and Avro.
天睿在1984年推出了并行处理 DBC 1012系统。1992年,Teradata 系统首次存储和分析了1tb 的数据。1991年硬盘驱动器是2.5 GB,所以大数据的定义在不断发展。Teradata 在2007年安装了第一个 petabyte 类 RDBMS 为基础的系统。,安装了几十个 petabyte 类 Teradata 关系数据库,其中最大的超过50pb。直到2008年,系统都是100% 的结构化关系数据。从那时起,Teradata 增加了包括 XML、 JSON 和 Avro 在内的非结构化数据类型。
In 2000, Seisint Inc. (now [[LexisNexis Risk Solutions]]) developed a [[C++]]-based distributed platform for data processing and querying known as the [[HPCC Systems]] platform. This system automatically partitions, distributes, stores and delivers structured, semi-structured, and unstructured data across multiple commodity servers. Users can write data processing pipelines and queries in a declarative dataflow programming language called ECL. Data analysts working in ECL are not required to define data schemas upfront and can rather focus on the particular problem at hand, reshaping data in the best possible manner as they develop the solution. In 2004, LexisNexis acquired Seisint Inc.<ref>{{cite news| url=https://www.washingtonpost.com/wp-dyn/articles/A50577-2004Jul14.html|title=LexisNexis To Buy Seisint For $775 Million|newspaper=[[The Washington Post]]|access-date=15 July 2004}}</ref> and their high-speed parallel processing platform and successfully used this platform to integrate the data systems of Choicepoint Inc. when they acquired that company in 2008.<ref>[https://www.washingtonpost.com/wp-dyn/content/article/2008/02/21/AR2008022100809.html The Washington Post]</ref> In 2011, the HPCC systems platform was open-sourced under the Apache v2.0 License.
In 2000, Seisint Inc. (now LexisNexis Risk Solutions) developed a C++-based distributed platform for data processing and querying known as the HPCC Systems platform. This system automatically partitions, distributes, stores and delivers structured, semi-structured, and unstructured data across multiple commodity servers. Users can write data processing pipelines and queries in a declarative dataflow programming language called ECL. Data analysts working in ECL are not required to define data schemas upfront and can rather focus on the particular problem at hand, reshaping data in the best possible manner as they develop the solution. In 2004, LexisNexis acquired Seisint Inc. and their high-speed parallel processing platform and successfully used this platform to integrate the data systems of Choicepoint Inc. when they acquired that company in 2008.The Washington Post In 2011, the HPCC systems platform was open-sourced under the Apache v2.0 License.
2000年,Seisint 公司(现在的 LexisNexis 风险解决方案)开发了一个基于 c + + 的分布式数据处理和查询平台,称为 HPCC 系统平台。这个系统自动分区、分发、存储和交付结构化、半结构化和跨多个商品服务器的非结构化数据。用户可以使用称为 ECL 的声明性数据流编程语言编写数据处理管道和查询。在 ECL 中工作的数据分析师不需要事先定义数据模式,而是可以专注于手头的特定问题,在开发解决方案时以尽可能好的方式重新构造数据。2004年,LexisNexis 收购了 Seisint 公司及其高速并行处理平台,并在2008年收购 Choicepoint 公司时,成功地利用该平台集成了该公司的数据系统。华盛顿邮报2011年,HPCC 系统平台根据 Apache v2.0许可证开源。
[[CERN]] and other physics experiments have collected big data sets for many decades, usually analyzed via [[high-throughput computing]] rather than the map-reduce architectures usually meant by the current "big data" movement.
CERN and other physics experiments have collected big data sets for many decades, usually analyzed via high-throughput computing rather than the map-reduce architectures usually meant by the current "big data" movement.
CERN 和其他物理实验已经收集大数据集数十年了,通常是通过高吞吐量计算进行分析,而不是通常意味着当前“大数据”运动的地图缩减架构。
In 2004, [[Google]] published a paper on a process called [[MapReduce]] that uses a similar architecture. The MapReduce concept provides a parallel processing model, and an associated implementation was released to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the "map" step). The results are then gathered and delivered (the "reduce" step). The framework was very successful,<ref>Bertolucci, Jeff [http://www.informationweek.com/software/hadoop-from-experiment-to-leading-big-data-platform/d/d-id/1110491? "Hadoop: From Experiment To Leading Big Data Platform"], "Information Week", 2013. Retrieved on 14 November 2013.</ref> so others wanted to replicate the algorithm. Therefore, an [[implementation]] of the MapReduce framework was adopted by an Apache open-source project named "[[Apache Hadoop|Hadoop]]".<ref>Webster, John. [http://research.google.com/archive/mapreduce-osdi04.pdf "MapReduce: Simplified Data Processing on Large Clusters"], "Search Storage", 2004. Retrieved on 25 March 2013.</ref> [[Apache Spark]] was developed in 2012 in response to limitations in the MapReduce paradigm, as it adds the ability to set up many operations (not just map followed by reducing).
In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. The MapReduce concept provides a parallel processing model, and an associated implementation was released to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the "map" step). The results are then gathered and delivered (the "reduce" step). The framework was very successful,Bertolucci, Jeff "Hadoop: From Experiment To Leading Big Data Platform", "Information Week", 2013. Retrieved on 14 November 2013. so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open-source project named "Hadoop".Webster, John. "MapReduce: Simplified Data Processing on Large Clusters", "Search Storage", 2004. Retrieved on 25 March 2013. Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm, as it adds the ability to set up many operations (not just map followed by reducing).
2004年,谷歌发表了一篇名为 MapReduce 的论文,该论文使用了类似的架构。MapReduce 概念提供了一个并行处理模型,并发布了一个相关的实现来处理大量的数据。使用 MapReduce,查询被拆分并分布在并行节点上,并且被并行处理(“映射”步骤)。然后收集和交付结果(“ reduce”步骤)。这个框架非常成功,Bertolucci,Jeff“ Hadoop: 从实验到领导大数据平台”,“信息周”,2013。检索于2013年11月14日,所以其他人希望复制该算法。因此,MapReduce 框架的实现被一个名为“ Hadoop”的 Apache 开源项目所采用。“ MapReduce: 大型集群上的简化数据处理”,“ Search Storage”,2004年。2013年3月25日。Apache Spark 是在2012年针对 MapReduce 范例的限制而开发的,因为它增加了设置许多操作的能力(不仅仅是映射后的减少)。
[[MIKE2.0 Methodology|MIKE2.0]] is an open approach to information management that acknowledges the need for revisions due to big data implications identified in an article titled "Big Data Solution Offering".<ref>{{cite web| url=http://mike2.openmethodology.org/wiki/Big_Data_Solution_Offering| title=Big Data Solution Offering|publisher=MIKE2.0|access-date=8 December 2013}}</ref> The methodology addresses handling big data in terms of useful [[permutation]]s of data sources, [[complexity]] in interrelationships, and difficulty in deleting (or modifying) individual records.<ref>{{cite web|url=http://mike2.openmethodology.org/wiki/Big_Data_Definition|title=Big Data Definition|publisher=MIKE2.0|access-date=9 March 2013}}</ref>
MIKE2.0 is an open approach to information management that acknowledges the need for revisions due to big data implications identified in an article titled "Big Data Solution Offering". The methodology addresses handling big data in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records.
MIKE2.0是一个开放的信息管理方法,它承认由于《大数据解决方案提供》一文中确定的大数据影响,需要进行修订。这种方法论通过数据源的有用排列、相互关系的复杂性以及删除(或修改)单个记录的困难来处理大数据。
Studies in 2012 showed that a multiple-layer architecture was one option to address the issues that big data presents. A [[List of file systems#Distributed parallel file systems|distributed parallel]] architecture distributes data across multiple servers; these parallel execution environments can dramatically improve data processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end-user by using a front-end application server.<ref>{{cite journal|last=Boja|first=C|author2=Pocovnicu, A |author3=Bătăgan, L. |title=Distributed Parallel Architecture for Big Data|journal=Informatica Economica|year=2012 |volume=16|issue=2| pages=116–127}}</ref>
Studies in 2012 showed that a multiple-layer architecture was one option to address the issues that big data presents. A distributed parallel architecture distributes data across multiple servers; these parallel execution environments can dramatically improve data processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end-user by using a front-end application server.
2012年的研究表明,多层架构是解决大数据带来的问题的一种选择。分布式并行体系结构将数据分布在多个服务器上; 这些并行执行环境可以显著提高数据处理速度。这种架构将数据插入到并行 DBMS 中,实现了 MapReduce 和 Hadoop 框架的使用。这种类型的框架通过使用前端应用程序服务器来使处理能力对最终用户透明。
The [[data lake]] allows an organization to shift its focus from centralized control to a shared model to respond to the changing dynamics of information management. This enables quick segregation of data into the data lake, thereby reducing the overhead time.<ref>{{cite web|url= http://www.hcltech.com/sites/default/files/solving_key_businesschallenges_with_big_data_lake_0.pdf|title=Solving Key Business Challenges With a Big Data Lake|date=August 2014| website=Hcltech.com|access-date=8 October 2017}}</ref><ref>{{ cite web| url= https://secplab.ppgia.pucpr.br/files/papers/2015-0.pdf | title= Method for testing the fault tolerance of MapReduce frameworks | publisher=Computer Networks | year=2015}}</ref>
The data lake allows an organization to shift its focus from centralized control to a shared model to respond to the changing dynamics of information management. This enables quick segregation of data into the data lake, thereby reducing the overhead time.
数据库允许组织将其重点从集中控制转移到共享模型,以响应不断变化的信息管理动态。这样可以将数据快速隔离到数据湖中,从而减少开销时间。
==Technologies==
A 2011 [[McKinsey & Company|McKinsey Global Institute]] report characterizes the main components and ecosystem of big data as follows:<ref name="McKinsey">{{cite journal | last1 = Manyika | first1 = James | first2 = Michael | last2 = Chui | first3 = Jaques | last3 = Bughin | first4 = Brad | last4 = Brown | first5 = Richard | last5 = Dobbs | first6 = Charles | last6 = Roxburgh | first7 = Angela Hung | last7 = Byers | title = Big Data: The next frontier for innovation, competition, and productivity | publisher = McKinsey Global Institute | date = May 2011 | url = https://www.mckinsey.com/~/media/mckinsey/business%20functions/mckinsey%20digital/our%20insights/big%20data%20the%20next%20frontier%20for%20innovation/mgi_big_data_full_report.pdf | access-date = 22 May 2021 }}</ref>
* Techniques for analyzing data, such as [[A/B testing]], [[machine learning]], and [[natural language processing]]
* Big data technologies, like [[business intelligence]], [[cloud computing]], and [[database]]s
* Visualization, such as charts, graphs, and other displays of the data
A 2011 McKinsey Global Institute report characterizes the main components and ecosystem of big data as follows:
* Techniques for analyzing data, such as A/B testing, machine learning, and natural language processing
* Big data technologies, like business intelligence, cloud computing, and databases
* Visualization, such as charts, graphs, and other displays of the data
= = = = 麦肯锡全球研究所2011年的一份报告将大数据的主要组成部分和生态系统描述如下:
* 分析数据的技术,如 a/b 测试、机器学习和自然语言处理
* 大数据技术,如商业智能、云计算和数据库
* 可视化,如图表、图形和其他数据显示
Multidimensional big data can also be represented as [[OLAP]] data cubes or, mathematically, [[tensor]]s. [[Array DBMS|Array database systems]] have set out to provide storage and high-level query support on this data type.
Additional technologies being applied to big data include efficient tensor-based computation,<ref>{{cite web |title=Future Directions in Tensor-Based Computation and Modeling |date=May 2009|url=http://www.cs.cornell.edu/cv/tenwork/finalreport.pdf}}</ref> such as [[multilinear subspace learning]],<ref name="MSLsurvey">{{cite journal | first1 = Haiping | last1 = Lu | first2 = K.N. | last2 = Plataniotis | first3 = A.N. | last3 = Venetsanopoulos | url = http://www.dsp.utoronto.ca/~haiping/Publication/SurveyMSL_PR2011.pdf | title = A Survey of Multilinear Subspace Learning for Tensor Data | journal = Pattern Recognition | volume = 44 | number = 7 | pages = 1540–1551 | year = 2011 | doi = 10.1016/j.patcog.2011.01.004 | bibcode = 2011PatRe..44.1540L }}</ref> massively parallel-processing ([[Massive parallel processing|MPP]]) databases, [[search-based application]]s, [[data mining]],<ref>{{cite book|last1=Pllana|first1=Sabri|title=2011 14th International Conference on Network-Based Information Systems|pages=341–348|last2=Janciak|first2=Ivan|last3=Brezany|first3=Peter|last4=Wöhrer|first4=Alexander|chapter=A Survey of the State of the Art in Data Mining and Integration Query Languages |website=2011 International Conference on Network-Based Information Systems (NBIS 2011)|publisher=IEEE Computer Society|bibcode=2016arXiv160301113P|year=2016|arxiv=1603.01113|doi=10.1109/NBiS.2011.58|isbn=978-1-4577-0789-6|s2cid=9285984}}</ref> [[distributed file system]]s, distributed cache (e.g., [[burst buffer]] and [[Memcached]]), [[distributed database]]s, [[cloud computing|cloud]] and [[supercomputer|HPC-based]] infrastructure (applications, storage and computing resources),<ref>{{cite book|chapter=Characterization and Optimization of Memory-Resident MapReduce on HPC Systems|publisher=IEEE|date=October 2014|doi=10.1109/IPDPS.2014.87|title=2014 IEEE 28th International Parallel and Distributed Processing Symposium|pages=799–808|last1=Wang|first1=Yandong|last2=Goldstone|first2=Robin|last3=Yu|first3=Weikuan|last4=Wang|first4=Teng|s2cid=11157612|isbn=978-1-4799-3800-1}}</ref> and the Internet.{{Citation needed|date=September 2011}} Although, many approaches and technologies have been developed, it still remains difficult to carry out machine learning with big data.<ref>{{Cite journal|last1=L'Heureux|first1=A.|last2=Grolinger|first2=K.|last3=Elyamany|first3=H. F.|last4=Capretz|first4=M. A. M.|date=2017|title=Machine Learning With Big Data: Challenges and Approaches|journal=IEEE Access|volume=5|pages=7776–7797|doi=10.1109/ACCESS.2017.2696365|issn=2169-3536|doi-access=free}}</ref>
Multidimensional big data can also be represented as OLAP data cubes or, mathematically, tensors. Array database systems have set out to provide storage and high-level query support on this data type.
Additional technologies being applied to big data include efficient tensor-based computation, such as multilinear subspace learning, massively parallel-processing (MPP) databases, search-based applications, data mining, distributed file systems, distributed cache (e.g., burst buffer and Memcached), distributed databases, cloud and HPC-based infrastructure (applications, storage and computing resources), and the Internet. Although, many approaches and technologies have been developed, it still remains difficult to carry out machine learning with big data.
多维大数据也可以表示为 OLAP 数据立方体或者数学上的张量。阵列数据库系统已经着手为这种数据类型提供存储和高级查询支持。其他应用于大数据的技术包括高效的基于张量的计算,如多线性子空间学习、大规模并行处理(MPP)数据库、基于搜索的应用程序、数据挖掘、分布式文件系统、分布式缓存(如突发缓冲区和 Memcached)、分布式数据库、基于云和 hpc 的基础设施(应用程序、存储和计算资源) ,以及互联网。虽然已经开发了许多方法和技术,但是仍然很难实现大数据的机器学习。
Some [[Massive parallel processing|MPP]] relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the [[RDBMS]].<ref>{{cite web |author=Monash, Curt |title=eBay's two enormous data warehouses |date=30 April 2009 |url=http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/}}<br />{{cite web |author=Monash, Curt |title=eBay followup – Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more |date=6 October 2010 |url =http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/}}</ref>{{promotional source|date=December 2018}}
Some MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS.<br />
一些 MPP 关系数据库具有存储和管理 pb 级数据的能力。隐式是加载、监视、备份和优化 RDBMS 中大型数据表的使用的能力。< br/>
[[DARPA]]'s [[Topological Data Analysis]] program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called "Ayasdi".<ref>{{cite web|url=http://www.ayasdi.com/resources/|title=Resources on how Topological Data Analysis is used to analyze big data|publisher=Ayasdi}}</ref>{{thirdpartyinline|date=December 2018}}
DARPA's Topological Data Analysis program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called "Ayasdi".
美国国防部高级研究计划局的拓扑数据分析计划寻找海量数据集的基本结构。2008年,随着一家名为“ Ayasdi”的公司的成立,这项技术公之于众。
The practitioners of big data analytics processes are generally hostile to slower shared storage,<ref>{{cite web |title=Storage area networks need not apply |author=CNET News |date=1 April 2011 |url=http://news.cnet.com/8301-21546_3-20049693-10253464.html}}</ref> preferring direct-attached storage ([[Direct-attached storage|DAS]]) in its various forms from solid state drive ([[SSD]]) to high capacity [[Serial ATA|SATA]] disk buried inside parallel processing nodes. The perception of shared storage architectures—[[storage area network]] (SAN) and [[network-attached storage]] (NAS)— is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.
The practitioners of big data analytics processes are generally hostile to slower shared storage, preferring direct-attached storage (DAS) in its various forms from solid state drive (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—storage area network (SAN) and network-attached storage (NAS)— is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.
大数据分析处理的从业者通常不喜欢缓慢的共享存储,他们更喜欢各种形式的直接连接的存储设备,从固态硬盘(SSD)到埋藏在并行处理节点中的大容量 SATA 磁盘。对于共享存储架构ーー存储区域网络(SAN)和存储网络附加存储(NAS)ーー的看法是,它们相对缓慢、复杂和昂贵。这些特性与依赖于系统性能、商品基础设施和低成本的大数据分析系统不一致。
Real or near-real-time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in direct-attached memory or disk is good—data on memory or disk at the other end of an [[Fiber connector|FC]] [[Storage area network|SAN]] connection is not. The cost of an [[Storage area network|SAN]] at the scale needed for analytics applications is much higher than other storage techniques.
Real or near-real-time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in direct-attached memory or disk is good—data on memory or disk at the other end of an FC SAN connection is not. The cost of an SAN at the scale needed for analytics applications is much higher than other storage techniques.
实时或接近实时的信息传递是大数据分析的定义特征之一。因此,无论何时何地,只要有可能,就可以避免延迟。直接连接的存储器或磁盘中的数据是好的ーー FC SAN 连接另一端的存储器或磁盘上的数据是坏的。在分析应用程序所需的规模上,SAN 的成本要比其他存储技术高得多。
==Applications==
[[File:2013-09-11 Bus wrapped with SAP Big Data parked outside IDF13 (9730051783).jpg|thumb|Bus wrapped with [[SAP AG|SAP]] big data parked outside [[Intel Developer Forum|IDF13]].]]
Big data has increased the demand of information management specialists so much so that [[Software AG]], [[Oracle Corporation]], [[IBM]], [[Microsoft]], [[SAP AG|SAP]], [[EMC Corporation|EMC]], [[Hewlett-Packard|HP]], and [[Dell]] have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.{{r|Economist}}
Big data has increased the demand of information management specialists so much so that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP, and Dell have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.
大数据极大地增加了信息管理专家的需求,以至于 Software AG、甲骨文公司、 IBM、微软、 SAP、 EMC、惠普和戴尔已经在数据管理和分析软件公司上花费了超过150亿美元。在2010年,这个行业价值超过1000亿美元,并且以每年近10% 的速度增长: 大约是整个软件行业的两倍。
Developed economies increasingly use data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide, and between 1 billion and 2 billion people accessing the internet.{{r|Economist}} Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which means more people became more literate, which in turn led to information growth. The world's effective capacity to exchange information through telecommunication networks was 281 [[petabytes]] in 1986, 471 [[petabytes]] in 1993, 2.2 exabytes in 2000, 65 [[exabytes]] in 2007<ref name="martinhilbert.net"/> and predictions put the amount of internet traffic at 667 exabytes annually by 2014.{{r|Economist}} According to one estimate, one-third of the globally stored information is in the form of alphanumeric text and still image data,<ref name="HilbertContent">{{cite journal|title= What is the Content of the World's Technologically Mediated Information and Communication Capacity: How Much Text, Image, Audio, and Video?| doi= 10.1080/01972243.2013.873748 | volume=30| issue=2 |journal=The Information Society|pages=127–143|year = 2014|last1 = Hilbert|first1 = Martin| s2cid= 45759014 | url= https://escholarship.org/uc/item/87w5f6wb }}</ref> which is the format most useful for most big data applications. This also shows the potential of yet unused data (i.e. in the form of video and audio content).
Developed economies increasingly use data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide, and between 1 billion and 2 billion people accessing the internet. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which means more people became more literate, which in turn led to information growth. The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007 and predictions put the amount of internet traffic at 667 exabytes annually by 2014. According to one estimate, one-third of the globally stored information is in the form of alphanumeric text and still image data, which is the format most useful for most big data applications. This also shows the potential of yet unused data (i.e. in the form of video and audio content).
发达经济体越来越多地使用数据密集型技术。全世界有46亿移动电话用户,10亿到20亿人使用互联网。从1990年到2005年,全世界有超过10亿人进入中产阶级,这意味着更多的人变得更有文化,进而导致信息增长。世界通过电信网络交换信息的有效容量在1986年为281千兆字节,1993年为471千兆字节,2000年为2.2千兆字节,2007年为65千兆字节,预计到2014年每年的互联网流量将达到667千兆字节。据估计,全球储存的信息有三分之一是字母数字文本和静止图像数据,这是大多数大数据应用程序最有用的格式。这也显示了尚未使用的数据的潜力(即。以视频和音频内容的形式)。
While many vendors offer off-the-shelf products for big data, experts promote the development of in-house custom-tailored systems if the company has sufficient technical capabilities.<ref>{{cite web |url=http://www.kdnuggets.com/2014/07/interview-amy-gershkoff-ebay-in-house-BI-tools.html |title=Interview: Amy Gershkoff, Director of Customer Analytics & Insights, eBay on How to Design Custom In-House BI Tools |last1=Rajpurohit |first1=Anmol |date=11 July 2014 |website= KDnuggets|access-date=14 July 2014|quote=Generally, I find that off-the-shelf business intelligence tools do not meet the needs of clients who want to derive custom insights from their data. Therefore, for medium-to-large organizations with access to strong technical talent, I usually recommend building custom, in-house solutions.}}</ref>
While many vendors offer off-the-shelf products for big data, experts promote the development of in-house custom-tailored systems if the company has sufficient technical capabilities.
虽然许多供应商提供现成的大数据产品,但如果公司拥有足够的技术能力,专家则推动开发内部定制系统。
===Government===
The use and adoption of big data within governmental processes allows efficiencies in terms of cost, productivity, and innovation,<ref>{{cite magazine|url =http://www.computerworld.com/article/2472667/government-it/the-government-and-big-data--use--problems-and-potential.html |title=The Government and big data: Use, problems and potential |date=21 March 2012 |magazine=[[Computerworld]] |access-date=12 September 2016}}</ref> but does not come without its flaws. Data analysis often requires multiple parts of government (central and local) to work in collaboration and create new and innovative processes to deliver the desired outcome. A common government organization that makes use of big data is the National Security Administration ([[National Security Agency|NSA]]), which monitors the activities of the Internet constantly in search for potential patterns of suspicious or illegal activities their system may pick up.
The use and adoption of big data within governmental processes allows efficiencies in terms of cost, productivity, and innovation, but does not come without its flaws. Data analysis often requires multiple parts of government (central and local) to work in collaboration and create new and innovative processes to deliver the desired outcome. A common government organization that makes use of big data is the National Security Administration (NSA), which monitors the activities of the Internet constantly in search for potential patterns of suspicious or illegal activities their system may pick up.
在政府流程中使用和采用大数据可以在成本、生产力和创新方面提高效率,但也存在缺陷。数据分析往往需要多个政府部门(中央和地方)协同工作,创建新的创新流程,以实现预期成果。利用大数据的一个常见政府组织是国家安全局,该局不断监测互联网的活动,以搜索其系统可能发现的可疑或非法活动的潜在模式。
[[Civil registration and vital statistics]] (CRVS) collects all certificates status from birth to death. CRVS is a source of big data for governments.
Civil registration and vital statistics (CRVS) collects all certificates status from birth to death. CRVS is a source of big data for governments.
民事登记和人口动态统计收集从出生到死亡的所有证明状态。民事登记和人口动态统计系统是政府大数据的一个来源。
===International development===
Research on the effective usage of information and communication technologies for development (also known as "ICT4D") suggests that big data technology can make important contributions but also present unique challenges to [[international development]].<ref>{{cite web| url=http://www.unglobalpulse.org/projects/BigDataforDevelopment |title=White Paper: Big Data for Development: Opportunities & Challenges (2012) – United Nations Global Pulse| website=Unglobalpulse.org |access-date=13 April 2016}}</ref><ref>{{cite web| title=WEF (World Economic Forum), & Vital Wave Consulting. (2012). Big Data, Big Impact: New Possibilities for International Development|work= World Economic Forum|access-date=24 August 2012| url= http://www.weforum.org/reports/big-data-big-impact-new-possibilities-international-development}}</ref> Advancements in big data analysis offer cost-effective opportunities to improve decision-making in critical development areas such as health care, employment, [[economic productivity]], crime, security, and [[natural disaster]] and resource management.<ref name="HilbertBigData2013" /><ref>{{cite web|url=http://blogs.worldbank.org/ic4d/four-ways-to-talk-about-big-data/|title=Elena Kvochko, Four Ways To talk About Big Data (Information Communication Technologies for Development Series)|publisher=worldbank.org|access-date=30 May 2012|date=4 December 2012}}</ref><ref>{{cite web| title=Daniele Medri: Big Data & Business: An on-going revolution| url=http://www.statisticsviews.com/details/feature/5393251/Big-Data--Business-An-on-going-revolution.html| publisher=Statistics Views| date=21 October 2013| access-date=21 June 2015| archive-date=17 June 2015| archive-url=https://web.archive.org/web/20150617211645/http://www.statisticsviews.com/details/feature/5393251/Big-Data--Business-An-on-going-revolution.html| url-status=dead}}</ref> Additionally, user-generated data offers new opportunities to give the unheard a voice.<ref>{{cite web|title=Responsible use of data|author=Tobias Knobloch and Julia Manske|work= D+C, Development and Cooperation|date=11 January 2016|url= http://www.dandc.eu/en/article/opportunities-and-risks-user-generated-and-automatically-compiled-data}}</ref> However, longstanding challenges for developing regions such as inadequate technological infrastructure and economic and human resource scarcity exacerbate existing concerns with big data such as privacy, imperfect methodology, and interoperability issues.<ref name="HilbertBigData2013" /> The challenge of "big data for development"<ref name="HilbertBigData2013" /> is currently evolving toward the application of this data through machine learning, known as "artificial intelligence for development (AI4D).<ref>Mann, S., & Hilbert, M. (2020). AI4D: Artificial Intelligence for Development. International Journal of Communication, 14(0), 21. https://www.martinhilbert.net/ai4d-artificial-intelligence-for-development/</ref>
Research on the effective usage of information and communication technologies for development (also known as "ICT4D") suggests that big data technology can make important contributions but also present unique challenges to international development. Advancements in big data analysis offer cost-effective opportunities to improve decision-making in critical development areas such as health care, employment, economic productivity, crime, security, and natural disaster and resource management. Additionally, user-generated data offers new opportunities to give the unheard a voice. However, longstanding challenges for developing regions such as inadequate technological infrastructure and economic and human resource scarcity exacerbate existing concerns with big data such as privacy, imperfect methodology, and interoperability issues. The challenge of "big data for development" is currently evolving toward the application of this data through machine learning, known as "artificial intelligence for development (AI4D).Mann, S., & Hilbert, M. (2020). AI4D: Artificial Intelligence for Development. International Journal of Communication, 14(0), 21. https://www.martinhilbert.net/ai4d-artificial-intelligence-for-development/
= = = 国际发展 = = 关于有效利用信息和通信技术促进发展的研究(又称“ ICT4D”)表明,大数据技术可以作出重要贡献,但也对国际发展提出独特的挑战。海量数据分析的进步为改善关键发展领域的决策提供了成本效益高的机会,这些领域包括保健、就业、经济生产力、犯罪、安全、自然灾害和资源管理。此外,用户生成的数据提供了新的机会,给未听到的声音。然而,发展中地区面临的长期挑战,如技术基础设施不足、经济和人力资源稀缺,加剧了人们对大数据的现有担忧,如隐私、方法不完善以及互操作性问题。“大数据促进发展”的挑战目前正朝着通过机器学习(被称为“人工智能促进发展(AI4D)”)应用这些数据的方向发展。希尔伯特 · 曼(2020)。AI4D: 人工智能促进发展。国际通信杂志,14(0) ,21. https://www.martinhilbert.net/ai4d-artificial-intelligence-for-development/
====Benefits====
A major practical application of big data for development has been "fighting poverty with data".<ref>Blumenstock, J. E. (2016). Fighting poverty with data. Science, 353(6301), 753–754. https://doi.org/10.1126/science.aah5217</ref> In 2015, Blumenstock and colleagues estimated predicted poverty and wealth from mobile phone metadata <ref>Blumenstock, J., Cadamuro, G., & On, R. (2015). Predicting poverty and wealth from mobile phone metadata. Science, 350(6264), 1073–1076. https://doi.org/10.1126/science.aac4420</ref> and in 2016 Jean and colleagues combined satellite imagery and machine learning to predict poverty.<ref>Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., & Ermon, S. (2016). Combining satellite imagery and machine learning to predict poverty. Science, 353(6301), 790–794. https://doi.org/10.1126/science.aaf7894</ref> Using digital trace data to study the labor market and the digital economy in Latin America, Hilbert and colleagues <ref name="HilbertJobMarket">Hilbert, M., & Lu, K. (2020). The online job market trace in Latin America and the Caribbean (UN ECLAC LC/TS.2020/83; p. 79). United Nations Economic Commission for Latin America and the Caribbean. https://www.cepal.org/en/publications/45892-online-job-market-trace-latin-america-and-caribbean</ref><ref>UN ECLAC, (United Nations Economic Commission for Latin America and the Caribbean). (2020). Tracking the digital footprint in Latin America and the Caribbean: Lessons learned from using big data to assess the digital economy (Productive Development, Gender Affairs LC/TS.2020/12; Documentos de Proyecto). United Nations ECLAC. https://repositorio.cepal.org/handle/11362/45484</ref> argue that digital trace data has several benefits such as:
* Thematic coverage: including areas that were previously difficult or impossible to measure
* Geographical coverage: our international sources provided sizable and comparable data for almost all countries, including many small countries that usually are not included in international inventories
* Level of detail: providing fine-grained data with many interrelated variables, and new aspects, like network connections
* Timeliness and timeseries: graphs can be produced within days of being collected
A major practical application of big data for development has been "fighting poverty with data".Blumenstock, J. E. (2016). Fighting poverty with data. Science, 353(6301), 753–754. https://doi.org/10.1126/science.aah5217 In 2015, Blumenstock and colleagues estimated predicted poverty and wealth from mobile phone metadata Blumenstock, J., Cadamuro, G., & On, R. (2015). Predicting poverty and wealth from mobile phone metadata. Science, 350(6264), 1073–1076. https://doi.org/10.1126/science.aac4420 and in 2016 Jean and colleagues combined satellite imagery and machine learning to predict poverty.Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., & Ermon, S. (2016). Combining satellite imagery and machine learning to predict poverty. Science, 353(6301), 790–794. https://doi.org/10.1126/science.aaf7894 Using digital trace data to study the labor market and the digital economy in Latin America, Hilbert and colleagues Hilbert, M., & Lu, K. (2020). The online job market trace in Latin America and the Caribbean (UN ECLAC LC/TS.2020/83; p. 79). United Nations Economic Commission for Latin America and the Caribbean. https://www.cepal.org/en/publications/45892-online-job-market-trace-latin-america-and-caribbeanUN ECLAC, (United Nations Economic Commission for Latin America and the Caribbean). (2020). Tracking the digital footprint in Latin America and the Caribbean: Lessons learned from using big data to assess the digital economy (Productive Development, Gender Affairs LC/TS.2020/12; Documentos de Proyecto). United Nations ECLAC. https://repositorio.cepal.org/handle/11362/45484 argue that digital trace data has several benefits such as:
* Thematic coverage: including areas that were previously difficult or impossible to measure
* Geographical coverage: our international sources provided sizable and comparable data for almost all countries, including many small countries that usually are not included in international inventories
* Level of detail: providing fine-grained data with many interrelated variables, and new aspects, like network connections
* Timeliness and timeseries: graphs can be produced within days of being collected
= = = 益处 = = = 大数据在促进发展方面的主要实际应用是”用数据战胜贫穷”。用数据战胜贫困。科学》 ,353(6301) ,753-754。Https://doi.org/10.1126/science.aah5217在2015年,Blumenstock 和他的同事们通过手机元数据预测了贫困和财富。 Blumenstock,j. ,Cadamuro,g. ,& On,r. (2015)。通过移动电话元数据预测贫穷和财富。科学》 ,350(6264) ,1073-1076。Https://doi.org/10.1126/science.aac4420和 Jean 及其同事在2016年将卫星地图和机器学习结合起来预测贫困。珍,n,伯克,m,谢,m,戴维斯,w. m,罗贝尔,d. b. ,埃尔蒙,s. (2016)。结合卫星地图和机器学习来预测贫困。科学》 ,353(6301) ,790-794。Https://doi.org/10.1126/science.aaf7894使用数字跟踪数据来研究拉丁美洲的劳动力市场和数字经济,Hilbert 和他的同事 Hilbert,m. ,& Lu,k. (2020)。拉丁美洲和加勒比在线就业市场追踪(联合国拉加经委会 LC/TS. 2020/83; 第79页)。联合国拉丁美洲和加勒比经济委员会。联合国拉丁美洲和加勒比经济 https://www.cepal.org/en/publications/45892-online-job-market-trace-Latin-America-and-caribbeanun。(2020).跟踪拉丁美洲和加勒比数据痕迹: 从使用大数据评估数字经济中吸取的教训(生产性发展,性别事务 LC/TS. 2020/12; 文件 de Proyecto)。联合国拉加经委会。Https://repositorio.cepal.org/handle/11362/45484认为数字跟踪数据有几个好处,例如:
* 专题覆盖面: 包括以前难以或不可能衡量的领域
* 地理覆盖面: 我们的国际资料来源为几乎所有国家提供了大量可比数据,包括许多通常不包括在国际清单中的小国
* 详细程度: 提供具有许多相关变量的细粒度数据,以及新方面,例如网络连接
* 及时性和时间: 图表可以在收集后的几天内生成
====Challenges====
At the same time, working with digital trace data instead of traditional survey data does not eliminate the traditional challenges involved when working in the field of international quantitative analysis. Priorities change, but the basic discussions remain the same. Among the main challenges are:
* Representativeness. While traditional development statistics is mainly concerned with the representativeness of random survey samples, digital trace data is never a random sample.<ref>{{Cite journal|last1=Banerjee|first1=Amitav|last2=Chaudhury|first2=Suprakash|date=2010|title=Statistics without tears: Populations and samples|journal=Industrial Psychiatry Journal|volume=19|issue=1|pages=60–65|doi=10.4103/0972-6748.77642|issn=0972-6748|pmc=3105563|pmid=21694795}}</ref>
* Generalizability. While observational data always represents this source very well, it only represents what it represents, and nothing more. While it is tempting to generalize from specific observations of one platform to broader settings, this is often very deceptive.
* Harmonization. Digital trace data still requires international harmonization of indicators. It adds the challenge of so-called "data-fusion", the harmonization of different sources.
* Data overload. Analysts and institutions are not used to effectively deal with a large number of variables, which is efficiently done with interactive dashboards. Practitioners still lack a standard workflow that would allow researchers, users and policymakers to efficiently and effectively.<ref name="HilbertJobMarket" />
At the same time, working with digital trace data instead of traditional survey data does not eliminate the traditional challenges involved when working in the field of international quantitative analysis. Priorities change, but the basic discussions remain the same. Among the main challenges are:
* Representativeness. While traditional development statistics is mainly concerned with the representativeness of random survey samples, digital trace data is never a random sample.
* Generalizability. While observational data always represents this source very well, it only represents what it represents, and nothing more. While it is tempting to generalize from specific observations of one platform to broader settings, this is often very deceptive.
* Harmonization. Digital trace data still requires international harmonization of indicators. It adds the challenge of so-called "data-fusion", the harmonization of different sources.
* Data overload. Analysts and institutions are not used to effectively deal with a large number of variables, which is efficiently done with interactive dashboards. Practitioners still lack a standard workflow that would allow researchers, users and policymakers to efficiently and effectively.
与此同时,使用数字痕迹数据而不是传统的调查数据并不能消除在国际定量分析领域工作时所面临的传统挑战。优先顺序改变了,但是基本的讨论仍然没有改变。主要挑战包括:
* 代表性。传统的发展统计主要关注随机调查样本的代表性,而数字痕迹数据绝不是随机样本。
* 概括性。虽然观测数据总是很好地代表了这个来源,但它仅仅代表了它所代表的东西,仅此而已。虽然从一个平台的具体观察归纳到更广泛的背景是很诱人的,但这往往具有很强的欺骗性。
* 协调。数字跟踪数据仍然需要指标的国际协调。它增加了所谓的“数据融合”的挑战,不同来源的协调。
* 资料过载。分析师和机构不习惯于有效地处理大量的变量,而这是通过交互式仪表板有效地完成的。从业人员仍然缺乏一个标准的工作流程,使研究人员、用户和决策者能够高效和有效地工作。
===Healthcare===
Big data analytics was used in healthcare by providing personalized medicine and prescriptive analytics, clinical risk intervention and predictive analytics, waste and care variability reduction, automated external and internal reporting of patient data, standardized medical terms and patient registries.<ref name="ref135">{{cite journal | vauthors = Huser V, Cimino JJ | title = Impending Challenges for the Use of Big Data | journal = International Journal of Radiation Oncology, Biology, Physics | volume = 95 | issue = 3 | pages = 890–894 | date = July 2016 | pmid = 26797535 | pmc = 4860172 | doi = 10.1016/j.ijrobp.2015.10.060 }}</ref><ref>{{Cite book|title=Signal Processing and Machine Learning for Biomedical Big Data.|others=Sejdić, Ervin, Falk, Tiago H.|isbn=9781351061216|location=[Place of publication not identified]|oclc=1044733829|last1 = Sejdic|first1 = Ervin|last2 = Falk|first2 = Tiago H.|date = 4 July 2018}}</ref><ref>{{cite journal | vauthors = Raghupathi W, Raghupathi V | title = Big data analytics in healthcare: promise and potential | journal = Health Information Science and Systems | volume = 2 | issue = 1 | pages = 3 | date = December 2014 | pmid = 25825667 | pmc = 4341817 | doi = 10.1186/2047-2501-2-3 }}</ref><ref>{{cite journal | vauthors = Viceconti M, Hunter P, Hose R | title = Big data, big knowledge: big data for personalized healthcare | journal = IEEE Journal of Biomedical and Health Informatics | volume = 19 | issue = 4 | pages = 1209–15 | date = July 2015 | pmid = 26218867 | doi = 10.1109/JBHI.2015.2406883 | s2cid = 14710821 | url = http://eprints.whiterose.ac.uk/89104/1/pap%20JBHI%20BigData%20in%20VPH%20revision%20v2.pdf | doi-access = free }}</ref> Some areas of improvement are more aspirational than actually implemented. The level of data generated within [[Health system|healthcare systems]] is not trivial. With the added adoption of mHealth, eHealth and wearable technologies the volume of data will continue to increase. This includes [[electronic health record]] data, imaging data, patient generated data, sensor data, and other forms of difficult to process data. There is now an even greater need for such environments to pay greater attention to data and information quality.<ref>{{cite journal|title=Data Management Within mHealth Environments: Patient Sensors, Mobile Devices, and Databases |first1=John| last1=O'Donoghue |first2=John|last2=Herbert|s2cid=2318649|date=1 October 2012|volume=4|issue=1|pages=5:1–5:20| doi=10.1145/2378016.2378021 |journal=Journal of Data and Information Quality}}</ref> "Big data very often means '[[dirty data]]' and the fraction of data inaccuracies increases with data volume growth." Human inspection at the big data scale is impossible and there is a desperate need in health service for intelligent tools for accuracy and believability control and handling of information missed.<ref name="Mirkes2016">{{cite journal | vauthors = Mirkes EM, Coats TJ, Levesley J, Gorban AN | title = Handling missing data in large healthcare dataset: A case study of unknown trauma outcomes | journal = Computers in Biology and Medicine | volume = 75 | pages = 203–16 | date = August 2016 | pmid = 27318570 | doi = 10.1016/j.compbiomed.2016.06.004 | arxiv = 1604.00627 | bibcode = 2016arXiv160400627M | s2cid = 5874067 }}</ref> While extensive information in healthcare is now electronic, it fits under the big data umbrella as most is unstructured and difficult to use.<ref>{{cite journal | vauthors = Murdoch TB, Detsky AS | title = The inevitable application of big data to health care | journal = JAMA | volume = 309 | issue = 13 | pages = 1351–2 | date = April 2013 | pmid = 23549579 | doi = 10.1001/jama.2013.393 }}</ref> The use of big data in healthcare has raised significant ethical challenges ranging from risks for individual rights, privacy and [[autonomy]], to transparency and trust.<ref>{{cite journal | vauthors = Vayena E, Salathé M, Madoff LC, Brownstein JS | title = Ethical challenges of big data in public health | journal = PLOS Computational Biology | volume = 11 | issue = 2 | pages = e1003904 | date = February 2015 | pmid = 25664461 | pmc = 4321985 | doi = 10.1371/journal.pcbi.1003904 | bibcode = 2015PLSCB..11E3904V }}</ref>
Big data analytics was used in healthcare by providing personalized medicine and prescriptive analytics, clinical risk intervention and predictive analytics, waste and care variability reduction, automated external and internal reporting of patient data, standardized medical terms and patient registries. Some areas of improvement are more aspirational than actually implemented. The level of data generated within healthcare systems is not trivial. With the added adoption of mHealth, eHealth and wearable technologies the volume of data will continue to increase. This includes electronic health record data, imaging data, patient generated data, sensor data, and other forms of difficult to process data. There is now an even greater need for such environments to pay greater attention to data and information quality. "Big data very often means 'dirty data' and the fraction of data inaccuracies increases with data volume growth." Human inspection at the big data scale is impossible and there is a desperate need in health service for intelligent tools for accuracy and believability control and handling of information missed. While extensive information in healthcare is now electronic, it fits under the big data umbrella as most is unstructured and difficult to use. The use of big data in healthcare has raised significant ethical challenges ranging from risks for individual rights, privacy and autonomy, to transparency and trust.
通过提供个体化医学和规范性分析,临床风险干预和预测分析,减少浪费和护理变异性,病人数据的自动化外部和内部报告,标准化的医学术语和病人登记,大数据分析在医疗保健中得到了应用。一些需要改进的领域比实际执行的更具雄心壮志。在医疗保健系统中生成的数据级别并不是微不足道的。随着移动健康、电子健康和可穿戴技术的广泛应用,数据量将继续增长。这包括电子健康记录数据、成像数据、患者生成数据、传感器数据以及其他难以处理的数据形式。现在更加需要这种环境更加重视数据和信息质量。“大数据往往意味着‘脏数据’,数据不准确的比例随着数据量的增长而增加。”在大数据规模的人类检查是不可能的,在卫生服务中迫切需要智能工具,以实现准确性和可信度控制,并处理遗漏的信息。虽然现在医疗保健领域的大量信息都是电子化的,但是由于大多数信息都是非结构化的,难以使用,因此它们都被归入了大数据的范畴。在医疗保健中使用大数据引发了重大的道德挑战,从个人权利、隐私和自主权的风险,到透明度和信任度。
Big data in health research is particularly promising in terms of exploratory biomedical research, as data-driven analysis can move forward more quickly than hypothesis-driven research.<ref>{{Cite journal|last=Copeland|first=CS|date=Jul–Aug 2017|title=Data Driving Discovery|url=http://claudiacopeland.com/uploads/3/5/5/6/35560346/_hjno_data_driving_discovery_2pv.pdf|journal=Healthcare Journal of New Orleans|pages=22–27}}</ref> Then, trends seen in data analysis can be tested in traditional, hypothesis-driven follow up biological research and eventually clinical research.
Big data in health research is particularly promising in terms of exploratory biomedical research, as data-driven analysis can move forward more quickly than hypothesis-driven research. Then, trends seen in data analysis can be tested in traditional, hypothesis-driven follow up biological research and eventually clinical research.
健康研究中的大数据在探索性生物医学研究方面特别有前途,因为数据驱动的分析可以比假设驱动的研究更快地向前推进。然后,数据分析的趋势可以在传统的、假设驱动的后续生物学研究和最终的临床研究中得到验证。
A related application sub-area, that heavily relies on big data, within the healthcare field is that of [[computer-aided diagnosis]] in medicine.
<ref name="CAD7challenges">{{cite journal | vauthors = Yanase J, Triantaphyllou E| title = A Systematic Survey of Computer-Aided Diagnosis in Medicine: Past and Present Developments. | journal = Expert Systems with Applications | volume = 138 | pages = 112821 | date = 2019 | doi = 10.1016/j.eswa.2019.112821 | s2cid = 199019309 }}</ref> For instance, for [[epilepsy]] monitoring it is customary to create 5 to 10 GB of data daily.
<ref>{{cite journal | vauthors = Dong X, Bahroos N, Sadhu E, Jackson T, Chukhman M, Johnson R, Boyd A, Hynes D| title = Leverage Hadoop framework for large scale clinical informatics applications | journal = AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science | pages = 53 | date = 2013 | volume = 2013 | pmid = 24303235 }}</ref> Similarly, a single uncompressed image of breast [[tomosynthesis]] averages 450 MB of data.
<ref>{{cite journal | vauthors = Clunie D| title = Breast tomosynthesis challenges digital imaging infrastructure | url = http://www.auntminnie.com/index.aspx?sec=prtf&sub=def&pag=dis&itemId=102872&printpage=true&fsec=ser&fsub=def | date = 2013 }}</ref>
These are just a few of the many examples where [[computer-aided diagnosis]] uses big data. For this reason, big data has been recognized as one of the seven key challenges that computer-aided diagnosis systems need to overcome in order to reach the next level of performance.
<ref>
{{cite journal | vauthors = Yanase J, Triantaphyllou E | title = The Seven Key Challenges for the Future of Computer-Aided Diagnosis in Medicine | journal = International Journal of Medical Informatics| volume = 129 | pages = 413–422 | year = 2019 | doi = 10.1016/j.ijmedinf.2019.06.017 | pmid = 31445285 | s2cid = 198287435 }}
</ref>
A related application sub-area, that heavily relies on big data, within the healthcare field is that of computer-aided diagnosis in medicine.
For instance, for epilepsy monitoring it is customary to create 5 to 10 GB of data daily.
Similarly, a single uncompressed image of breast tomosynthesis averages 450 MB of data.
These are just a few of the many examples where computer-aided diagnosis uses big data. For this reason, big data has been recognized as one of the seven key challenges that computer-aided diagnosis systems need to overcome in order to reach the next level of performance.
在医疗保健领域,一个相关的应用子领域,严重依赖于大数据,那就是医药电脑辅助诊断。例如,对于癫痫监测,通常每天创建5到10gb 的数据。同样,一张未压缩的乳房断层合成图像平均有450mb 的数据。这些只是电脑辅助诊断使用大数据的众多例子中的一小部分。基于这个原因,大数据已经被认为是电脑辅助诊断系统需要克服的7个关键挑战之一,以达到下一个性能水平。
===Education===
A [[McKinsey & Company|McKinsey Global Institute]] study found a shortage of 1.5 million highly trained data professionals and managers<ref name="McKinsey"/> and a number of universities<ref>{{cite web
| url=https://www.forbes.com/sites/jmaureenhenderson/2013/07/30/degrees-in-big-data-fad-or-fast-track-to-career-success/
|access-date=21 February 2016
|website=[[Forbes]]
|title=Degrees in Big Data: Fad or Fast Track to Career Success}}</ref>{{better source needed|date=November 2018|reason=www.forbes.com/sites by contributors rather than staff are blogs, not reliable sources for facts.}} including [[University of Tennessee]] and [[UC Berkeley]], have created masters programs to meet this demand. Private boot camps have also developed programs to meet that demand, including free programs like [[The Data Incubator]] or paid programs like [[General Assembly]].<ref>{{cite news
|title=NY gets new boot camp for data scientists: It's free but harder to get into than Harvard
|newspaper=Venture Beat
|access-date=21 February 2016
|url=https://venturebeat.com/2014/04/15/ny-gets-new-bootcamp-for-data-scientists-its-free-but-harder-to-get-into-than-harvard/
}}</ref> In the specific field of marketing, one of the problems stressed by Wedel and Kannan<ref>{{cite journal|last=Wedel|first=Michel|author2=Kannan, PK|title= Marketing Analytics for Data-Rich Environments|journal=Journal of Marketing|year=2016|volume=80|issue=6|doi= 10.1509/jm.15.0413|pages=97–121|s2cid=168410284}}</ref> is that marketing has several sub domains (e.g., advertising, promotions,
product development, branding) that all use different types of data.
A McKinsey Global Institute study found a shortage of 1.5 million highly trained data professionals and managers and a number of universities including University of Tennessee and UC Berkeley, have created masters programs to meet this demand. Private boot camps have also developed programs to meet that demand, including free programs like The Data Incubator or paid programs like General Assembly. In the specific field of marketing, one of the problems stressed by Wedel and Kannan is that marketing has several sub domains (e.g., advertising, promotions,
product development, branding) that all use different types of data.
麦肯锡全球研究所的一项研究发现,受过高等培训的数据专业人员和管理人员短缺150万人,包括田纳西大学和加州大学伯克利分校在内的一些大学已经开设了硕士课程来满足这一需求。私营新兵训练营也开发了一些项目来满足这种需求,包括免费的数据孵化器项目或者付费的大会项目。在特定的营销领域,Wedel 和 Kannan 强调的问题之一是,营销有几个子领域(例如,广告、促销、产品开发、品牌) ,它们都使用不同类型的数据。
===Media===
To understand how the media uses big data, it is first necessary to provide some context into the mechanism used for media process. It has been suggested by Nick Couldry and Joseph Turow that practitioners in media and advertising approach big data as many actionable points of information about millions of individuals. The industry appears to be moving away from the traditional approach of using specific media environments such as newspapers, magazines, or television shows and instead taps into consumers with technologies that reach targeted people at optimal times in optimal locations. The ultimate aim is to serve or convey, a message or content that is (statistically speaking) in line with the consumer's mindset. For example, publishing environments are increasingly tailoring messages (advertisements) and content (articles) to appeal to consumers that have been exclusively gleaned through various [[data-mining]] activities.<ref>{{cite journal|last1=Couldry|first1=Nick|last2=Turow|first2=Joseph|title=Advertising, Big Data, and the Clearance of the Public Realm: Marketers' New Approaches to the Content Subsidy| journal=International Journal of Communication|date=2014|volume=8|pages=1710–1726}}</ref>
* Targeting of consumers (for advertising by marketers)<ref>{{cite web|url=https://ishti.org/2018/04/15/why-digital-advertising-agencies-suck-at-acquisition-and-are-in-dire-need-of-an-ai-assisted-upgrade/|title=Why Digital Advertising Agencies Suck at Acquisition and are in Dire Need of an AI Assisted Upgrade|website=Ishti.org|access-date=15 April 2018|date=15 April 2018|archive-date=12 February 2019|archive-url=https://web.archive.org/web/20190212174722/https://ishti.org/2018/04/15/why-digital-advertising-agencies-suck-at-acquisition-and-are-in-dire-need-of-an-ai-assisted-upgrade/|url-status=dead}}</ref>
* Data capture
* [[Data journalism]]: publishers and journalists use big data tools to provide unique and innovative insights and [[infographic]]s.
To understand how the media uses big data, it is first necessary to provide some context into the mechanism used for media process. It has been suggested by Nick Couldry and Joseph Turow that practitioners in media and advertising approach big data as many actionable points of information about millions of individuals. The industry appears to be moving away from the traditional approach of using specific media environments such as newspapers, magazines, or television shows and instead taps into consumers with technologies that reach targeted people at optimal times in optimal locations. The ultimate aim is to serve or convey, a message or content that is (statistically speaking) in line with the consumer's mindset. For example, publishing environments are increasingly tailoring messages (advertisements) and content (articles) to appeal to consumers that have been exclusively gleaned through various data-mining activities.
* Targeting of consumers (for advertising by marketers)
* Data capture
* Data journalism: publishers and journalists use big data tools to provide unique and innovative insights and infographics.
= = = Media = = = 要了解媒体是如何使用大数据的,首先需要提供一些关于媒体处理机制的上下文。尼克 · 科尔德里和约瑟夫 · 特罗建议媒体和广告从业者把大数据当作数百万个人的可操作的信息点。该行业似乎正在摆脱使用特定媒体环境的传统方式,如报纸、杂志或电视节目,而是利用技术在最佳地点以最佳时间接触目标人群,从而深入消费者。最终的目标是服务或者传达,符合消费者心态的信息或者内容。例如,发布环境越来越多地对消息(广告)和内容(文章)进行裁剪,以吸引那些专门通过各种数据挖掘活动收集的消费者。
* 以消费者为目标(营销人员用于广告)
* 数据捕捉
* 数据新闻: 出版商和记者使用大数据工具提供独特和创新的见解和信息图表。
[[Channel 4]], the British [[Public service broadcasting in the United Kingdom|public-service]] television broadcaster, is a leader in the field of big data and [[data analysis]].<ref>{{cite web|url=https://www.ibc.org/tech-advances/big-data-and-analytics-c4-and-genius-digital/1076.article |title=Big data and analytics: C4 and Genius Digital|website=Ibc.org |access-date=8 October 2017}}</ref>
Channel 4, the British public-service television broadcaster, is a leader in the field of big data and data analysis.
英国公共服务电视广播公司第四频道是大数据和数据分析领域的领导者。
===Insurance===
Health insurance providers are collecting data on social "determinants of health" such as food and [[Television consumption|TV consumption]], marital status, clothing size, and purchasing habits, from which they make predictions on health costs, in order to spot health issues in their clients. It is controversial whether these predictions are currently being used for pricing.<ref>{{Cite web|author=Marshall Allen|url=https://www.propublica.org/article/health-insurers-are-vacuuming-up-details-about-you-and-it-could-raise-your-rates| title=Health Insurers Are Vacuuming Up Details About You – And It Could Raise Your Rates|website=www.propublica.org|date=17 July 2018|access-date=21 July 2018}}</ref>
Health insurance providers are collecting data on social "determinants of health" such as food and TV consumption, marital status, clothing size, and purchasing habits, from which they make predictions on health costs, in order to spot health issues in their clients. It is controversial whether these predictions are currently being used for pricing.
= = = = 医疗保险提供者正在收集关于诸如食物和电视消费、婚姻状况、衣服尺寸和购买习惯等社会”健康决定因素”的数据,从而对医疗费用进行预测,以便发现客户的健康问题。这些预测目前是否被用于定价还存在争议。
===Internet of things (IoT)===
{{Main|Internet of things}}
{{See|Edge computing}}
Big data and the IoT work in conjunction. Data extracted from IoT devices provides a mapping of device inter-connectivity. Such mappings have been used by the media industry, companies, and governments to more accurately target their audience and increase media efficiency. The IoT is also increasingly adopted as a means of gathering sensory data, and this sensory data has been used in medical,<ref>{{cite web|url=http://www.businesswire.com/news/home/20170109006500/en/QuiO-Named-Innovation-Champion-Accenture-HealthTech-Innovation|title=QuiO Named Innovation Champion of the Accenture HealthTech Innovation Challenge|website=Businesswire.com|access-date=8 October 2017| date=10 January 2017}}</ref> manufacturing<ref>{{cite web|url= https://www.predix.com/sites/default/files/IDC_OT_Final_whitepaper_249120.pdf |title=A Software Platform for Operational Technology Innovation|website=Predix.com|access-date=8 October 2017}}</ref> and transportation<ref name="BigDataIoT16">{{cite web|url =http://www.wiomax.com/big-data-driven-smart-transportation-the-underlying-big-story-of-smart-iot-transformed-mobility/| title=Big Data Driven Smart Transportation: the Underlying Story of IoT Transformed Mobility| author=Z. Jenipher Wang|date=March 2017}}</ref> contexts.
Big data and the IoT work in conjunction. Data extracted from IoT devices provides a mapping of device inter-connectivity. Such mappings have been used by the media industry, companies, and governments to more accurately target their audience and increase media efficiency. The IoT is also increasingly adopted as a means of gathering sensory data, and this sensory data has been used in medical, manufacturing and transportation contexts.
= = = 物联网(IoT) = = = 大数据与物联网协同工作。从物联网设备中提取的数据提供了设备间连接的映射。这样的映射已经被媒体行业、公司和政府用来更精确地定位他们的受众并提高媒体效率。物联网也越来越多地被用作收集感官数据的手段,这些感官数据已经被用于医疗、制造和运输领域。
[[Kevin Ashton]], the digital innovation expert who is credited with coining the term,<ref>{{cite web|url=http://www.rfidjournal.com/articles/view?4986|title=That Internet Of Things Thing.}}</ref> defines the Internet of things in this quote: "If we had computers that knew everything there was to know about things—using data they gathered without any help from us—we would be able to track and count everything, and greatly reduce waste, loss, and cost. We would know when things needed replacing, repairing, or recalling, and whether they were fresh or past their best."
Kevin Ashton, the digital innovation expert who is credited with coining the term, defines the Internet of things in this quote: "If we had computers that knew everything there was to know about things—using data they gathered without any help from us—we would be able to track and count everything, and greatly reduce waste, loss, and cost. We would know when things needed replacing, repairing, or recalling, and whether they were fresh or past their best."
数字创新专家凯文 · 阿什顿(Kevin Ashton)被誉为“物联网”(Internet of things)的创始人,他在这句话中给物联网下了这样的定义: “如果我们有一台了解一切的计算机——在没有我们帮助的情况下使用它们收集的数据——我们就能够跟踪和计算一切,大大减少浪费、损失和成本。”。我们会知道什么时候需要更换、修理或回收,以及这些东西是新的还是过时的。”
===Information technology===
Especially since 2015, big data has come to prominence within [[business operations]] as a tool to help employees work more efficiently and streamline the collection and distribution of [[information technology]] (IT). The use of big data to resolve IT and data collection issues within an enterprise is called [[IT operations analytics]] (ITOA).<ref name="ITOA1">{{cite web|last1=Solnik|first1=Ray |title=The Time Has Come: Analytics Delivers for IT Operations |url =http://www.datacenterjournal.com/time-analytics-delivers-operations/|website=Data Center Journal| access-date=21 June 2016}}</ref> By applying big data principles into the concepts of [[machine intelligence]] and deep computing, IT departments can predict potential issues and prevent them.<ref name="ITOA1" /> ITOA businesses offer platforms for [[systems management]] that bring [[data silos]] together and generate insights from the whole of the system rather than from isolated pockets of data.
Especially since 2015, big data has come to prominence within business operations as a tool to help employees work more efficiently and streamline the collection and distribution of information technology (IT). The use of big data to resolve IT and data collection issues within an enterprise is called IT operations analytics (ITOA). By applying big data principles into the concepts of machine intelligence and deep computing, IT departments can predict potential issues and prevent them. ITOA businesses offer platforms for systems management that bring data silos together and generate insights from the whole of the system rather than from isolated pockets of data.
= = = 信息技术 = = = 特别是自2015年以来,大数据作为帮助雇员提高工作效率和简化信息技术的收集和分发的一种工具,在企业运作中日益受到重视。使用大数据来解决企业内部的 IT 和数据收集问题被称为 IT 操作分析(ITOA)。通过将大数据原理应用到机器智能和深度计算的概念中,IT 部门可以预测潜在的问题并预防它们。ITOA 企业提供系统管理平台,将数据竖井集中在一起,从整个系统而不是从孤立的数据块中产生见解。
==Case studies==
===Government===
===Government===
= = 案例研究 = = = = = 政府 = = =
====China====
* The Integrated Joint Operations Platform (IJOP, 一体化联合作战平台) is used by the government to monitor the population, particularly [[Uyghurs]].<ref name="WP8218">{{cite news| url=https://www.washingtonpost.com/opinions/global-opinions/ethnic-cleansing-makes-a-comeback--in-china/2018/08/02/| archive-url=https://web.archive.org/web/20190331161843/https://www.washingtonpost.com/opinions/global-opinions/ethnic-cleansing-makes-a-comeback--in-china/2018/08/02/| url-status=dead| archive-date=31 March 2019|title=Ethnic cleansing makes a comeback – in China|author1=Josh Rogin|date=2 August 2018|access-date=4 August 2018|issue=Washington Post|quote=Add to that the unprecedented security and surveillance state in Xinjiang, which includes all-encompassing monitoring based on identity cards, checkpoints, facial recognition and the collection of DNA from millions of individuals. The authorities feed all this data into an artificial-intelligence machine that rates people's loyalty to the Communist Party in order to control every aspect of their lives.}}</ref> [[Biometrics]], including DNA samples, are gathered through a program of free physicals.<ref name="how022618">{{cite web|url= https://www.hrw.org/news/2018/02/26/china-big-data-fuels-crackdown-minority-region |title=China: Big Data Fuels Crackdown in Minority Region: Predictive Policing Program Flags Individuals for Investigations, Detentions|date=26 February 2018|website=hrw.org|publisher=Human Rights Watch|access-date=4 August 2018}}</ref>
*By 2020, China plans to give all its citizens a personal "social credit" score based on how they behave.<ref>{{cite news |title=Discipline and Punish: The Birth of China's Social-Credit System |url=https://www.thenation.com/article/china-social-credit-system/ |work=The Nation |date=23 January 2019}}</ref> The [[Social Credit System]], now being piloted in a number of Chinese cities, is considered a form of [[Mass surveillance in China|mass surveillance]] which uses big data analysis technology.<ref>{{cite news |title=China's behavior monitoring system bars some from travel, purchasing property |url=https://www.cbsnews.com/news/china-social-credit-system-surveillance-cameras/ |work=CBS News |date=24 April 2018}}</ref>{{Dubious|date=December 2021}}<ref>{{cite magazine |title=The complicated truth about China's social credit system |url=https://www.wired.co.uk/article/china-social-credit-system-explained |magazine=WIRED |date=21 January 2019}}</ref>
* The Integrated Joint Operations Platform (IJOP, 一体化联合作战平台) is used by the government to monitor the population, particularly Uyghurs. Biometrics, including DNA samples, are gathered through a program of free physicals.
*By 2020, China plans to give all its citizens a personal "social credit" score based on how they behave. The Social Credit System, now being piloted in a number of Chinese cities, is considered a form of mass surveillance which uses big data analysis technology.
====China====
* The Integrated Joint Operations Platform (IJOP, 一体化联合作战平台) is used by the government to monitor the population, particularly Uyghurs.生物测定学,包括 DNA 样本,是通过一个免费的体检程序收集的。
* 到2020年,中国计划根据所有公民的行为给予他们个人的“社会信用”评分。社会信用体系,目前正在中国的一些城市试点,被认为是一种使用大数据分析技术的大规模监控形式。
====India====
* Big data analysis was tried out for the [[Bharatiya Janata Party|BJP]] to win the 2014 Indian General Election.<ref>{{cite web|url = http://www.livemint.com/Industry/bUQo8xQ3gStSAy5II9lxoK/Are-Indian-companies-making-enough-sense-of-Big-Data.html|title = News: Live Mint|date = 23 June 2014|access-date = 22 November 2014|website = Are Indian companies making enough sense of Big Data?|publisher = Live Mint}}</ref>
* The [[Government of India|Indian government]] uses numerous techniques to ascertain how the Indian electorate is responding to government action, as well as ideas for policy augmentation.
* Big data analysis was tried out for the BJP to win the 2014 Indian General Election.
* The Indian government uses numerous techniques to ascertain how the Indian electorate is responding to government action, as well as ideas for policy augmentation.
为了赢得2014年印度大选,印度人民党尝试了大数据分析。
* 印度政府使用多种技术来确定印度选民对政府行动的反应,以及政策增强的想法。
====Israel====
* Personalized diabetic treatments can be created through GlucoMe's big data solution.<ref>{{Cite news|url=https://www.timesofisrael.com/israeli-startup-uses-big-data-minimal-hardware-to-treat-diabetes/|title=Israeli startup uses big data, minimal hardware to treat diabetes |work=[[The Times of Israel]] |access-date=28 February 2018}}</ref>
* Personalized diabetic treatments can be created through GlucoMe's big data solution.
= = = = = = = = = = = = = 通过 GlucoMe 的大数据解决方案,可以创建个性化的糖尿病治疗。
====United Kingdom====
Examples of uses of big data in public services:
Examples of uses of big data in public services:
= = = = 联合王国 = = = = 在公共服务中使用大数据的实例:
* Data on prescription drugs: by connecting origin, location and the time of each prescription, a research unit was able to exemplify and examine the considerable delay between the release of any given drug, and a UK-wide adaptation of the [[National Institute for Health and Care Excellence]] guidelines. This suggests that new or most up-to-date drugs take some time to filter through to the general patient.{{citation needed|date=January 2021}}<ref>{{Cite journal|last=Singh, Gurparkash, Duane Schulthess, Nigel Hughes, Bart Vannieuwenhuyse, and Dipak Kalra|title=Real world big data for clinical research and drug development|journal=Drug Discovery Today|year=2018|volume=23|issue=3|pages=652–660|doi=10.1016/j.drudis.2017.12.002|pmid=29294362}}</ref>
* Joining up data: a local authority [[Data blending|blended data]] about services, such as road gritting rotas, with services for people at risk, such as [[Meals on Wheels]]. The connection of data allowed the local authority to avoid any weather-related delay.<ref>{{cite web|url= https://www.researchgate.net/publication/297762848|title=Recent advances delivered by Mobile Cloud Computing and Internet of Things for Big Data applications: a survey|date=11 March 2016 |publisher=International Journal of Network Management|access-date=14 September 2016}}</ref>
* Data on prescription drugs: by connecting origin, location and the time of each prescription, a research unit was able to exemplify and examine the considerable delay between the release of any given drug, and a UK-wide adaptation of the National Institute for Health and Care Excellence guidelines. This suggests that new or most up-to-date drugs take some time to filter through to the general patient.
* Joining up data: a local authority blended data about services, such as road gritting rotas, with services for people at risk, such as Meals on Wheels. The connection of data allowed the local authority to avoid any weather-related delay.
* 关于处方药的数据: 通过将处方的来源、地点和时间联系起来,一个研究单位能够举例说明和审查任何特定药物发放之间相当长的延误,以及全英国对国家保健和优质护理研究所准则的调整。这表明,新的或最新的药物需要一些时间过滤到普通病人。
* 合并数据: 地方当局将服务数据(例如道路砂砾)与为高危人群提供的服务(例如送餐服务)混合在一起。数据的连接使地方当局能够避免任何与天气有关的延迟。
====United States====
* In 2012, the [[Presidency of Barack Obama|Obama administration]] announced the Big Data Research and Development Initiative, to explore how big data could be used to address important problems faced by the government.<ref name="WH_Big_Data">{{cite web|url=https://obamawhitehouse.archives.gov/blog/2012/03/29/big-data-big-deal |title=Big Data is a Big Deal|last=Kalil| first=Tom|access-date=26 September 2012|via=[[NARA|National Archives]] |work=[[whitehouse.gov]] |date=29 March 2012}}</ref> The initiative is composed of 84 different big data programs spread across six departments.<ref>{{cite web|url =https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final_1.pdf|archive-url =https://web.archive.org/web/20170121233257/https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final_1.pdf|url-status =live|archive-date =21 January 2017|title=Big Data Across the Federal Government|last=Executive Office of the President|date=March 2012 |via =[[NARA|National Archives]]|work =[[Office of Science and Technology Policy]]|access-date=26 September 2012}}</ref>
* Big data analysis played a large role in [[Barack Obama]]'s successful [[Barack Obama presidential campaign, 2012|2012 re-election campaign]].<ref name="infoworld_bigdata">{{cite web| url=http://www.infoworld.com/d/big-data/the-real-story-of-how-big-data-analytics-helped-obama-win-212862|title=The real story of how big data analytics helped Obama win|last=Lampitt| first=Andrew |work=[[InfoWorld]]|access-date=31 May 2014|date=14 February 2013}}</ref>
* The [[United States Federal Government]] owns five of the ten most powerful [[supercomputer]]s in the world.<ref>{{Cite web | url=https://www.top500.org/lists/2018/11/ |title = November 2018 | TOP500 Supercomputer Sites}}</ref><ref>{{cite web|url= http://www.informationweek.com/government/enterprise-applications/image-gallery-governments-10-most-powerf/224700271|title=Government's 10 Most Powerful Supercomputers|last=Hoover|first=J. Nicholas |work=Information Week|publisher=UBM|access-date=26 September 2012}}</ref>
* The [[Utah Data Center]] has been constructed by the United States [[National Security Agency]]. When finished, the facility will be able to handle a large amount of information collected by the NSA over the Internet. The exact amount of storage space is unknown, but more recent sources claim it will be on the order of a few [[exabyte]]s.<ref>{{cite magazine |url=https://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/all/1| title=The NSA Is Building the Country's Biggest Spy Center (Watch What You Say)|last=Bamford|first=James|date=15 March 2012| magazine=[[Wired (magazine)|Wired]] |access-date=18 March 2013}}</ref><ref>{{cite web|url=http://www.nsa.gov/public_info/press_room/2011/utah_groundbreaking_ceremony.shtml|title=Groundbreaking Ceremony Held for $1.2 Billion Utah Data Center|publisher=National Security Agency Central Security Service|access-date=18 March 2013|archive-url=https://web.archive.org/web/20130905055004/http://www.nsa.gov/public_info/press_room/2011/utah_groundbreaking_ceremony.shtml|archive-date=5 September 2013|url-status=dead}}</ref><ref>{{cite magazine|url= https://www.forbes.com/sites/kashmirhill/2013/07/24/blueprints-of-nsa-data-center-in-utah-suggest-its-storage-capacity-is-less-impressive-than-thought/|title=Blueprints of NSA's Ridiculously Expensive Data Center in Utah Suggest It Holds Less Info Than Thought|last=Hill| first=Kashmir| magazine=Forbes| access-date=31 October 2013}}</ref> This has posed security concerns regarding the anonymity of the data collected.<ref>{{Cite news|url=https://www.huffingtonpost.com/2013/06/12/nsa-big-data_n_3423482.html|title=NSA Spying Controversy Highlights Embrace of Big Data|last1=Smith| first1=Gerry|date=12 June 2013|work=Huffington Post|access-date=7 May 2018|last2=Hallman| first2=Ben}}</ref>
* In 2012, the Obama administration announced the Big Data Research and Development Initiative, to explore how big data could be used to address important problems faced by the government. The initiative is composed of 84 different big data programs spread across six departments.
* Big data analysis played a large role in Barack Obama's successful 2012 re-election campaign.
* The United States Federal Government owns five of the ten most powerful supercomputers in the world.
* The Utah Data Center has been constructed by the United States National Security Agency. When finished, the facility will be able to handle a large amount of information collected by the NSA over the Internet. The exact amount of storage space is unknown, but more recent sources claim it will be on the order of a few exabytes. This has posed security concerns regarding the anonymity of the data collected.
2012年,奥巴马政府宣布了大数据研究和发展倡议,以探讨如何利用大数据来解决政府面临的重要问题。这项计划由分布在六个部门的84个不同的大数据程序组成。
* 大数据分析在巴拉克•奥巴马(Barack Obama)成功赢得2012年连任竞选中发挥了重要作用。
* 美国联邦政府拥有世界上功能最强大的十台超级计算机中的五台。犹他数据中心由美国国家安全局建造。完工后,该设施将能够处理国家安全局通过互联网收集的大量信息。存储空间的确切数量还不得而知,但最近有消息称,存储空间大约为几艾字节。这引起了对所收集数据的匿名性的安全担忧。
===Retail===
* [[Walmart]] handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data—the equivalent of 167 times the information contained in all the books in the US [[Library of Congress]].{{r|Economist}}
* [[Windermere Real Estate]] uses location information from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day.<ref>{{cite news| last=Wingfield |first=Nick |url= http://bits.blogs.nytimes.com/2013/03/12/predicting-commutes-more-accurately-for-would-be-home-buyers/ |title=Predicting Commutes More Accurately for Would-Be Home Buyers |work=The New York Times |date=12 March 2013 |access-date=21 July 2013}}</ref>
* FICO Card Detection System protects accounts worldwide.<ref name="fico.com">{{cite web| url=http://www.fico.com/en/Products/DMApps/Pages/FICO-Falcon-Fraud-Manager.aspx |title=FICO® Falcon® Fraud Manager |publisher=Fico.com |access-date=21 July 2013}}</ref>
* Walmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data—the equivalent of 167 times the information contained in all the books in the US Library of Congress.
* Windermere Real Estate uses location information from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day.
* FICO Card Detection System protects accounts worldwide.
= = = = = =
* 沃尔玛每小时处理超过100万笔客户交易,这些交易被导入数据库,估计包含超过2.5拍字节(2560太字节)的数据,相当于美国国会图书馆所有书籍所含信息的167倍。
* 文德米尔不动产利用接近一亿名司机的位置资料,帮助置业人士计算每天不同时段往返工作地点的典型驾驶时间。
* FICO 卡检测系统保护全球账户。
===Science===
* The [[Large Hadron Collider]] experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995%<ref>{{cite web|last1=Alexandru|first1=Dan|title=Prof|url=https://cds.cern.ch/record/1504817/files/CERN-THESIS-2013-004.pdf|website=cds.cern.ch|publisher=CERN|access-date=24 March 2015}}</ref> of these streams, there are 1,000 collisions of interest per second.<ref>{{cite web |title=LHC Brochure, English version. A presentation of the largest and the most powerful particle accelerator in the world, the Large Hadron Collider (LHC), which started up in 2008. Its role, characteristics, technologies, etc. are explained for the general public. |url=http://cds.cern.ch/record/1278169?ln=en |work=CERN-Brochure-2010-006-Eng. LHC Brochure, English version. |publisher=CERN |access-date=20 January 2013}}</ref><ref>{{cite web |title=LHC Guide, English version. A collection of facts and figures about the Large Hadron Collider (LHC) in the form of questions and answers. |url=http://cds.cern.ch/record/1092437?ln=en |work=CERN-Brochure-2008-001-Eng. LHC Guide, English version. |publisher=CERN |access-date=20 January 2013}}</ref><ref name="nature">{{cite news |title=High-energy physics: Down the petabyte highway |work= Nature |date= 19 January 2011 |first=Geoff |last=Brumfiel |doi= 10.1038/469282a |volume= 469 |pages= 282–83 |url= http://www.nature.com/news/2011/110119/full/469282a.html |bibcode=2011Natur.469..282B }}</ref>
** As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication ({{as of|2012|lc=y}}). This becomes nearly 200 petabytes after replication.
** If all sensor data were recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 [[exabyte]]s per day, before replication. To put the number in perspective, this is equivalent to 500 [[quintillion]] (5×10<sup>20</sup>) bytes per day, almost 200 times more than all the other sources combined in the world.
* The [[Square Kilometre Array]] is a radio telescope built of thousands of antennas. It is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one petabyte per day.<ref>{{cite web|url= http://www.zurich.ibm.com/pdf/astron/CeBIT+2013+Background+DOME.pdf|title=IBM Research – Zurich| website=Zurich.ibm.com|access-date=8 October 2017}}</ref><ref>{{cite web|url =https://arstechnica.com/science/2012/04/future-telescope-array-drives-development-of-exabyte-processing/|title=Future telescope array drives development of Exabyte processing|work=Ars Technica |date=2 April 2012|access-date=15 April 2015}}</ref> It is considered one of the most ambitious scientific projects ever undertaken.<ref>{{cite web|url=http://theconversation.com/australias-bid-for-the-square-kilometre-array-an-insiders-perspective-4891|title=Australia's bid for the Square Kilometre Array – an insider's perspective|date=1 February 2012|publisher=[[The Conversation (website)|The Conversation]]|access-date=27 September 2016}}</ref>
* When the [[Sloan Digital Sky Survey]] (SDSS) began to collect astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy previously. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information.<ref name="Economist">{{cite news |title=Data, data everywhere |url=http://www.economist.com/node/15557443 |newspaper=The Economist |date=25 February 2010 |access-date=9 December 2012}}</ref> When the [[Large Synoptic Survey Telescope]], successor to SDSS, comes online in 2020, its designers expect it to acquire that amount of data every five days.{{r|Economist}}
*[[Human Genome Project|Decoding the human genome]] originally took 10 years to process; now it can be achieved in less than a day. The DNA sequencers have divided the sequencing cost by 10,000 in the last ten years, which is 100 times cheaper than the reduction in cost predicted by [[Moore's law]].<ref>{{cite web|url=http://www.oecd.org/sti/ieconomy/Session_3_Delort.pdf#page=6|title=Delort P., OECD ICCP Technology Foresight Forum, 2012.|website=Oecd.org|access-date=8 October 2017}}</ref>
* The [[NASA]] Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster.<ref>{{cite web|url=http://www.nasa.gov/centers/goddard/news/releases/2010/10-051.html|title=NASA – NASA Goddard Introduces the NASA Center for Climate Simulation|website=Nasa.gov|access-date=13 April 2016}}</ref><ref>{{cite web|last=Webster |first=Phil|title=Supercomputing the Climate: NASA's Big Data Mission| url=http://www.csc.com/cscworld/publications/81769/81773-supercomputing_the_climate_nasa_s_big_data_mission |work=CSC World|publisher=Computer Sciences Corporation|access-date=18 January 2013|url-status=dead| archive-url =https://web.archive.org/web/20130104220150/http://www.csc.com/cscworld/publications/81769/81773-supercomputing_the_climate_nasa_s_big_data_mission|archive-date=4 January 2013}}</ref>
* Google's DNAStack compiles and organizes DNA samples of genetic data from around the world to identify diseases and other medical defects. These fast and exact calculations eliminate any "friction points", or human errors that could be made by one of the numerous science and biology experts working with the DNA. DNAStack, a part of Google Genomics, allows scientists to use the vast sample of resources from Google's search server to scale social experiments that would usually take years, instantly.<ref>{{cite news| url=https://www.theglobeandmail.com/life/health-and-fitness/health/these-six-great-neuroscience-ideas-could-make-the-leap-from-lab-to-market/article21681731/|title=These six great neuroscience ideas could make the leap from lab to market|date=20 November 2014|work=[[The Globe and Mail]]|access-date=1 October 2016}}</ref><ref>{{cite web|url=https://cloud.google.com/customers/dnastack/|title=DNAstack tackles massive, complex DNA datasets with Google Genomics|publisher=Google Cloud Platform |access-date=1 October 2016}}</ref>
* [[23andme]]'s [[DNA database]] contains the genetic information of over 1,000,000 people worldwide.<ref>{{cite web|title=23andMe – Ancestry|url=https://www.23andme.com/en-int/ancestry/| website=23andme.com| access-date=29 December 2016}}</ref> The company explores selling the "anonymous aggregated genetic data" to other researchers and pharmaceutical companies for research purposes if patients give their consent.<ref name=verge1>{{cite web|last1=Potenza|first1=Alessandra| title=23andMe wants researchers to use its kits, in a bid to expand its collection of genetic data|url=https://www.theverge.com/2016/7/13/12166960/23andme-genetic-testing-database-genotyping-research|website=The Verge|access-date=29 December 2016|date=13 July 2016}}</ref><ref>{{cite magazine| title=This Startup Will Sequence Your DNA, So You Can Contribute To Medical Research |url= https://www.fastcompany.com/3066775/innovation-agents/this-startup-will-sequence-your-dna-so-you-can-contribute-to-medical-resea|magazine=[[Fast Company]]|access-date=29 December 2016|date=23 December 2016}}</ref><ref>{{cite magazine|last1=Seife|first1=Charles|title=23andMe Is Terrifying, but Not for the Reasons the FDA Thinks|url=https://www.scientificamerican.com/article/23andme-is-terrifying-but-not-for-the-reasons-the-fda-thinks/|magazine=[[Scientific American]]|access-date=29 December 2016}}</ref><ref>{{cite web|last1=Zaleski|first1=Andrew|title=This biotech start-up is betting your genes will yield the next wonder drug|url=https://www.cnbc.com/2016/06/22/23andme-thinks-your-genes-are-the-key-to-blockbuster-drugs.html|publisher=CNBC|access-date=29 December 2016|date=22 June 2016}}</ref><ref>{{cite magazine|last1=Regalado|first1=Antonio|title=How 23andMe turned your DNA into a $1 billion drug discovery machine|url=https://www.technologyreview.com/s/601506/23andme-sells-data-for-drug-search/|magazine=[[MIT Technology Review]]|access-date=29 December 2016}}</ref> Ahmad Hariri, professor of psychology and neuroscience at [[Duke University]] who has been using 23andMe in his research since 2009 states that the most important aspect of the company's new service is that it makes genetic research accessible and relatively cheap for scientists.<ref name=verge1/> A study that identified 15 genome sites linked to depression in 23andMe's database lead to a surge in demands to access the repository with 23andMe fielding nearly 20 requests to access the depression data in the two weeks after publication of the paper.<ref>{{cite web|title=23andMe reports jump in requests for data in wake of Pfizer depression study {{!}} FierceBiotech |url =http://www.fiercebiotech.com/it/23andme-reports-jump-requests-for-data-wake-pfizer-depression-study| website=fiercebiotech.com|access-date=29 December 2016}}</ref>
*Computational fluid dynamics ([[Computational fluid dynamics|CFD]]) and hydrodynamic [[turbulence]] research generate massive data sets. The Johns Hopkins Turbulence Databases ([http://turbulence.pha.jhu.edu JHTDB]) contains over 350 terabytes of spatiotemporal fields from Direct Numerical simulations of various turbulent flows. Such data have been difficult to share using traditional methods such as downloading flat simulation output files. The data within JHTDB can be accessed using "virtual sensors" with various access modes ranging from direct web-browser queries, access through Matlab, Python, Fortran and C programs executing on clients' platforms, to cut out services to download raw data. The data have been used in over 150 scientific publications.
* The Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995% of these streams, there are 1,000 collisions of interest per second.
** As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (). This becomes nearly 200 petabytes after replication.
** If all sensor data were recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (5×1020) bytes per day, almost 200 times more than all the other sources combined in the world.
* The Square Kilometre Array is a radio telescope built of thousands of antennas. It is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one petabyte per day. It is considered one of the most ambitious scientific projects ever undertaken.
* When the Sloan Digital Sky Survey (SDSS) began to collect astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy previously. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2020, its designers expect it to acquire that amount of data every five days.
*Decoding the human genome originally took 10 years to process; now it can be achieved in less than a day. The DNA sequencers have divided the sequencing cost by 10,000 in the last ten years, which is 100 times cheaper than the reduction in cost predicted by Moore's law.
* The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster.
* Google's DNAStack compiles and organizes DNA samples of genetic data from around the world to identify diseases and other medical defects. These fast and exact calculations eliminate any "friction points", or human errors that could be made by one of the numerous science and biology experts working with the DNA. DNAStack, a part of Google Genomics, allows scientists to use the vast sample of resources from Google's search server to scale social experiments that would usually take years, instantly.
* 23andme's DNA database contains the genetic information of over 1,000,000 people worldwide. The company explores selling the "anonymous aggregated genetic data" to other researchers and pharmaceutical companies for research purposes if patients give their consent. Ahmad Hariri, professor of psychology and neuroscience at Duke University who has been using 23andMe in his research since 2009 states that the most important aspect of the company's new service is that it makes genetic research accessible and relatively cheap for scientists. A study that identified 15 genome sites linked to depression in 23andMe's database lead to a surge in demands to access the repository with 23andMe fielding nearly 20 requests to access the depression data in the two weeks after publication of the paper.
*Computational fluid dynamics (CFD) and hydrodynamic turbulence research generate massive data sets. The Johns Hopkins Turbulence Databases (JHTDB) contains over 350 terabytes of spatiotemporal fields from Direct Numerical simulations of various turbulent flows. Such data have been difficult to share using traditional methods such as downloading flat simulation output files. The data within JHTDB can be accessed using "virtual sensors" with various access modes ranging from direct web-browser queries, access through Matlab, Python, Fortran and C programs executing on clients' platforms, to cut out services to download raw data. The data have been used in over 150 scientific publications.
大型强子对撞机的实验代表了大约1.5亿个传感器每秒传输4千万次数据。每秒有将近6亿次碰撞。在过滤和避免记录超过99.99995% 的这些数据流之后,每秒会有1000次感兴趣的冲突。
*
* 因此,只有处理少于0.001% 的传感器流数据,所有四个 LHC 实验的数据流代表了复制前25 pb 的年速率()。在复制之后,这将变成近200拍字节。
*
* 如果所有的传感器数据都记录在大型强子对撞机中,那么数据流将极难处理。在复制之前,数据流的年速度将超过1.5亿拍字节,或者说接近每天500艾字节。换个角度来看,这相当于每天500兆字节(5 × 1020) ,几乎是世界上所有其他源加起来的200倍。平方千米阵天文台是一个由数千个天线组成的射电望远镜。预计将在2024年投入使用。总的来说,这些天线预计每天可以收集14艾字节,并存储1皮字节。它被认为是有史以来最雄心勃勃的科学项目之一。
* 2000年,当史隆数位巡天天文台开始收集天文数据时,它在最初的几个星期内积累的数据比以前在天文学史天文台收集的所有数据还要多。以每晚200gb 的速度,SDSS 已经积累了超过140tb 的信息。2020年,当 SDSS 的继任者大型综合巡天望远镜数据中心上线时,它的设计者希望它每五天就能获得这么多的数据。
* 解码人类基因组最初花了10年时间,现在不到一天就能完成。在过去的十年中,DNA 测序仪已经将测序成本分成了10000,这比摩尔定律预测的成本降低要便宜100倍。
* 美国宇航局气候模拟中心(NCCS)在发现超级计算机集群上存储了32pb 的气候观测和模拟数据。
* 谷歌的 DNAStack 收集和整理来自世界各地的基因数据的 DNA 样本,以识别疾病和其他医学缺陷。这些快速而精确的计算消除了任何“摩擦点”,或者可能由众多研究 DNA 的科学和生物学专家之一造成的人为错误。是谷歌基因组学的一部分,它允许科学家使用谷歌搜索服务器上的大量样本资源来规模化社会实验,这些实验通常需要数年的时间。
* 23andme 的 DNA 数据库包含了全世界超过100万人的基因信息。如果患者表示同意,该公司将向其他研究人员和制药公司出售“匿名聚合的基因数据”,用于研究目的。杜克大学心理学和神经科学教授艾哈迈德 · 哈里里自2009年以来一直在他的研究中使用23andme。一项在23andme 的数据库中确定了15个与抑郁症有关的基因组位点的研究导致了访问该数据库的需求激增,在论文发表后的两周内,23andMe 收到了近20个访问抑郁症数据的请求。
* 计算流体力学和水动力湍流研究产生大量数据集。约翰霍普金斯湍流数据库(JHTDB)包含来自各种湍流流动的直接数值模拟的超过350tb 的时空场。使用传统方法(如下载平面模拟输出文件)很难共享这些数据。JHTDB 内的数据可以通过“虚拟传感器”访问,访问方式多种多样,从直接的网络浏览器查询、通过 Matlab、 Python、 Fortran 和在客户平台上执行的 c 程序访问,到切断下载原始数据的服务。这些数据已在150多份科学出版物中得到应用。
===Sports===
Big data can be used to improve training and understanding competitors, using sport sensors. It is also possible to predict winners in a match using big data analytics.<ref>{{cite web|url=http://www.itweb.co.za/index.php?option=com_content&view=article&id=147241|title=Data scientists predict Springbok defeat |author=Admire Moyo| work=itweb.co.za|date=23 October 2015 |access-date=12 December 2015}}</ref>
Future performance of players could be predicted as well. Thus, players' value and salary is determined by data collected throughout the season.<ref>{{cite web|url=http://www.itweb.co.za/index.php?option=com_content&view=article&id=147852|title= Predictive analytics, big data transform sports
|author=Regina Pazvakavambwa|work=itweb.co.za|date= 17 November 2015
|access-date=12 December 2015}}</ref>
Big data can be used to improve training and understanding competitors, using sport sensors. It is also possible to predict winners in a match using big data analytics.
Future performance of players could be predicted as well. Thus, players' value and salary is determined by data collected throughout the season.
使用运动传感器,大数据可以用来改进训练和了解竞争对手。使用大数据分析也可以预测比赛中的胜利者。未来玩家的表现也可以预测。因此,球员的价值和薪水是由整个赛季收集的数据决定的。
In Formula One races, race cars with hundreds of sensors generate terabytes of data. These sensors collect data points from tire pressure to fuel burn efficiency.<ref>{{cite web|url=https://www.huffingtonpost.com/dave-ryan/sports-where-big-data-fin_b_8553884.html|title= Sports: Where Big Data Finally Makes Sense |author=Dave Ryan| work=huffingtonpost.com |date= 13 November 2015 |access-date=12 December 2015}}</ref>
Based on the data, engineers and data analysts decide whether adjustments should be made in order to win a race. Besides, using big data, race teams try to predict the time they will finish the race beforehand, based on simulations using data collected over the season.<ref>{{cite magazine|url=https://www.forbes.com/sites/frankbi/2014/11/13/how-formula-one-teams-are-using-big-data-to-get-the-inside-edge/|title= How Formula One Teams Are Using Big Data To Get The Inside Edge|author=Frank Bi|magazine=Forbes|access-date=12 December 2015}}</ref>
In Formula One races, race cars with hundreds of sensors generate terabytes of data. These sensors collect data points from tire pressure to fuel burn efficiency.
Based on the data, engineers and data analysts decide whether adjustments should be made in order to win a race. Besides, using big data, race teams try to predict the time they will finish the race beforehand, based on simulations using data collected over the season.
在一级方程式赛车比赛中,装有数百个传感器的赛车会产生太字节的数据。这些传感器收集数据点从轮胎压力到燃料燃烧效率。根据这些数据,工程师和数据分析师决定是否应该做出调整以赢得比赛。此外,通过使用大数据,比赛团队试图预测他们将提前完成比赛的时间,基于整个赛季收集的数据进行模拟。
===Technology===
* [[eBay.com]] uses two [[data warehouse]]s at 7.5 [[petabytes]] and 40PB as well as a 40PB [[Hadoop]] cluster for search, consumer recommendations, and merchandising.<ref>{{cite web | last=Tay | first=Liz |url=http://www.itnews.com.au/news/inside-ebay8217s-90pb-data-warehouse-342615 | title=Inside eBay's 90PB data warehouse | publisher=ITNews | access-date=12 February 2016}}</ref>
* [[Amazon.com]] handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and {{as of|2005|lc=on}} they had the world's three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.<ref>{{cite web|last=Layton |first=Julia |url= http://money.howstuffworks.com/amazon1.htm | title=Amazon Technology |date=25 January 2006 |publisher= Money.howstuffworks.com |access-date=5 March 2013}}</ref>
* [[Facebook]] handles 50 billion photos from its user base.<ref>{{cite web| url=https://www.facebook.com/notes/facebook-engineering/scaling-facebook-to-500-million-users-and-beyond/409881258919 |title=Scaling Facebook to 500 Million Users and Beyond |publisher =Facebook.com |access-date=21 July 2013}}</ref> {{as of|2017|June}}, Facebook reached 2 billion [[monthly active users]].<ref>{{Cite news|url=https://techcrunch.com/2017/06/27/facebook-2-billion-users/| title=Facebook now has 2 billion monthly users… and responsibility| last=Constine| first=Josh |date=27 June 2017|work=TechCrunch|access-date=3 September 2018}}</ref>
* [[Google]] was handling roughly 100 billion searches per month {{as of|2012|08|lc=on}}.<ref>{{cite web|url=http://searchengineland.com/google-1-trillion-searches-per-year-212940|title=Google Still Doing at Least 1 Trillion Searches Per Year|date=16 January 2015|work=Search Engine Land|access-date=15 April 2015}}</ref>
* eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising.
* Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and they had the world's three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.
* Facebook handles 50 billion photos from its user base. , Facebook reached 2 billion monthly active users.
* Google was handling roughly 100 billion searches per month .
网站使用两个数据仓库,分别为7.5 PB 和40 PB,还有一个40 PB Hadoop 集群,用于搜索、消费者推荐和销售。
* 亚马逊每天处理数以百万计的后端操作,以及超过50万第三方卖家的查询。亚马逊的核心技术是基于 Linux 的,它拥有世界上最大的三个 Linux 数据库,容量分别为7.8 TB、18.5 TB 和24.7 TB。
* Facebook 处理来自用户群的500亿张照片。的月活跃用户达到了20亿。
* 谷歌每月处理大约1000亿次搜索。
===COVID-19===
During the [[COVID-19 pandemic]], big data was raised as a way to minimise the impact of the disease. Significant applications of big data included minimising the spread of the virus, case identification and development of medical treatment.<ref>{{cite journal |last1=Haleem |first1=Abid |last2=Javaid |first2=Mohd |last3=Khan |first3=Ibrahim |last4=Vaishya |first4=Raju |title=Significant Applications of Big Data in COVID-19 Pandemic |journal=Indian Journal of Orthopaedics |date=2020 |volume=54 |issue=4 |pages=526–528 |doi=10.1007/s43465-020-00129-z |pmid=32382166 |pmc=7204193 }}</ref>
During the COVID-19 pandemic, big data was raised as a way to minimise the impact of the disease. Significant applications of big data included minimising the spread of the virus, case identification and development of medical treatment.
在2019冠状病毒疾病流行期间,大数据被作为一种将疾病影响降到最低的方法而被提出来。在2019冠状病毒疾病流行期间,大数据被作为一种将疾病影响降到最低的方法。大数据的重要应用包括最大限度地减少病毒的传播、病例识别和医疗发展。
Governments used big data to track infected people to minimise spread. Early adopters included China, Taiwan, South Korea, and Israel.<ref>{{cite news |last1=Manancourt |first1=Vincent |title=Coronavirus tests Europe's resolve on privacy |url=https://www.politico.eu/article/coronavirus-tests-europe-resolve-on-privacy-tracking-apps-germany-italy/ |access-date=30 October 2020 |work=Politico |date=10 March 2020}}</ref><ref>{{cite news |last1=Choudhury |first1=Amit Roy |title=Gov in the Time of Corona |url=https://govinsider.asia/innovation/gov-in-the-time-of-corona/ |access-date=30 October 2020 |work=Gov Insider |date=27 March 2020}}</ref><ref>{{cite news |last1=Cellan-Jones |first1=Rory |title=China launches coronavirus 'close contact detector' app |url=https://www.bbc.com/news/technology-51439401 |access-date=30 October 2020 |work=BBC |date=11 February 2020|archive-url=https://web.archive.org/web/20200228003957/https://www.bbc.com/news/technology-51439401 |archive-date=28 February 2020 }}</ref>
Governments used big data to track infected people to minimise spread. Early adopters included China, Taiwan, South Korea, and Israel.
各国政府利用大数据来追踪感染者,以最大限度地减少传播。早期的采用者包括中国、台湾、韩国和以色列。
==Research activities==
Encrypted search and cluster formation in big data were demonstrated in March 2014 at the American Society of Engineering Education. Gautam Siwach engaged at ''Tackling the challenges of Big Data'' by [[MIT Computer Science and Artificial Intelligence Laboratory]] and Amir Esmailpour at the UNH Research Group investigated the key features of big data as the formation of clusters and their interconnections. They focused on the security of big data and the orientation of the term towards the presence of different types of data in an encrypted form at cloud interface by providing the raw definitions and real-time examples within the technology. Moreover, they proposed an approach for identifying the encoding technique to advance towards an expedited search over encrypted text leading to the security enhancements in big data.<ref>{{cite conference |url=http://asee-ne.org/proceedings/2014/Student%20Papers/210.pdf |title=Encrypted Search & Cluster Formation in Big Data |last1=Siwach |first1=Gautam |last2=Esmailpour |first2=Amir |date=March 2014 |conference=ASEE 2014 Zone I Conference |conference-url=http://ubconferences.org/ |location=[[University of Bridgeport]], [[Bridgeport, Connecticut|Bridgeport]], Connecticut, US |access-date=26 July 2014 |archive-url=https://web.archive.org/web/20140809045242/http://asee-ne.org/proceedings/2014/Student%20Papers/210.pdf |archive-date=9 August 2014 |url-status=dead }}</ref>
Encrypted search and cluster formation in big data were demonstrated in March 2014 at the American Society of Engineering Education. Gautam Siwach engaged at Tackling the challenges of Big Data by MIT Computer Science and Artificial Intelligence Laboratory and Amir Esmailpour at the UNH Research Group investigated the key features of big data as the formation of clusters and their interconnections. They focused on the security of big data and the orientation of the term towards the presence of different types of data in an encrypted form at cloud interface by providing the raw definitions and real-time examples within the technology. Moreover, they proposed an approach for identifying the encoding technique to advance towards an expedited search over encrypted text leading to the security enhancements in big data.
= = 研究活动 = = 2014年3月,美国工程教育学会演示了大数据中的加密搜索和集群形成。由麻省理工学院计算机科学和人工智能实验室和 UNH 研究小组的 Amir Esmailpour 共同致力于解决大数据的挑战,他们研究了大数据的关键特征,即集群的形成及其相互联系。他们重点讨论了大数据的安全性以及该术语的方向,即通过提供技术中的原始定义和实时示例,在云界面上以加密形式存在不同类型的数据。此外,他们还提出了一种识别编码技术的方法,以便对加密文本进行快速搜索,从而加强大数据的安全性。
In March 2012, The White House announced a national "Big Data Initiative" that consisted of six federal departments and agencies committing more than $200 million to big data research projects.<ref>{{cite web|title=Obama Administration Unveils "Big Data" Initiative:Announces $200 Million in New R&D Investments| url=https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf |url-status =live| archive-url =https://web.archive.org/web/20170121233309/https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf |via=[[NARA|National Archives]]|work=[[Office of Science and Technology Policy]]|archive-date=21 January 2017}}</ref>
In March 2012, The White House announced a national "Big Data Initiative" that consisted of six federal departments and agencies committing more than $200 million to big data research projects.
2012年3月,白宫宣布了一项全国性的“大数据倡议”,由六个联邦部门和机构组成,向大数据研究项目投入了2亿多美元。
The initiative included a National Science Foundation "Expeditions in Computing" grant of $10 million over five years to the AMPLab<ref>{{cite web|url=http://amplab.cs.berkeley.edu |title=AMPLab at the University of California, Berkeley |publisher=Amplab.cs.berkeley.edu |access-date=5 March 2013}}</ref> at the University of California, Berkeley.<ref>{{cite web |title=NSF Leads Federal Efforts in Big Data|date=29 March 2012|publisher=National Science Foundation (NSF) |url= https://www.nsf.gov/news/news_summ.jsp?cntn_id=123607&org=NSF&from=news}}</ref> The AMPLab also received funds from [[DARPA]], and over a dozen industrial sponsors and uses big data to attack a wide range of problems from predicting traffic congestion<ref>{{cite conference| url=https://amplab.cs.berkeley.edu/publication/scaling-the-mobile-millennium-system-in-the-cloud-2/|author1=Timothy Hunter|date=October 2011|author2=Teodor Moldovan|author3=Matei Zaharia| author4 =Justin Ma|author5=Michael Franklin|author6-link=Pieter Abbeel|author6=Pieter Abbeel|author7=Alexandre Bayen |title=Scaling the Mobile Millennium System in the Cloud}}</ref> to fighting cancer.<ref>{{cite news|title=Computer Scientists May Have What It Takes to Help Cure Cancer|author=David Patterson|work=The New York Times| date=5 December 2011 |url=https://www.nytimes.com/2011/12/06/science/david-patterson-enlist-computer-scientists-in-cancer-fight.html}}</ref>
The initiative included a National Science Foundation "Expeditions in Computing" grant of $10 million over five years to the AMPLab at the University of California, Berkeley. The AMPLab also received funds from DARPA, and over a dozen industrial sponsors and uses big data to attack a wide range of problems from predicting traffic congestion to fighting cancer.
这一举措包括美国国家科学基金会”计算机探险”项目,该项目将在五年内向加州大学伯克利分校的 AMPLab 提供1000万美元的资助。美国国防部高级研究计划局也从美国国防部高级研究计划局和十几个工业赞助商那里获得了资金,并利用大数据来解决从预测交通堵塞到抗击癌症的一系列问题。
The White House Big Data Initiative also included a commitment by the Department of Energy to provide $25 million in funding over five years to establish the Scalable Data Management, Analysis and Visualization (SDAV) Institute,<ref>{{cite web|title=Secretary Chu Announces New Institute to Help Scientists Improve Massive Data Set Research on DOE Supercomputers |publisher=energy.gov |url=http://energy.gov/articles/secretary-chu-announces-new-institute-help-scientists-improve-massive-data-set-research-doe}}</ref> led by the Energy Department's [[Lawrence Berkeley National Laboratory]]. The SDAV Institute aims to bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the department's supercomputers.
The White House Big Data Initiative also included a commitment by the Department of Energy to provide $25 million in funding over five years to establish the Scalable Data Management, Analysis and Visualization (SDAV) Institute, led by the Energy Department's Lawrence Berkeley National Laboratory. The SDAV Institute aims to bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the department's supercomputers.
白宫大数据倡议还包括能源部承诺在未来五年内提供2500万美元的资金,用于建立可扩展的数据管理、分析和可视化研究所,由能源部下属的劳伦斯伯克利国家实验室数据中心领导。SDAV 研究所旨在汇集六个国家实验室和七所大学的专门知识,开发新的工具,以帮助科学家管理和可视化该部门超级计算机上的数据。
The U.S. state of [[Massachusetts]] announced the Massachusetts Big Data Initiative in May 2012, which provides funding from the state government and private companies to a variety of research institutions.<ref>{{Cite news|last=Young|first=Shannon|date=2012-05-30|title=Mass. governor, MIT announce big data initiative|work=Boston.com|url=http://archive.boston.com/news/local/massachusetts/articles/2012/05/30/mass_gov_and_mit_to_announce_data_initiative/|access-date=2021-07-29}}</ref> The [[Massachusetts Institute of Technology]] hosts the Intel Science and Technology Center for Big Data in the [[MIT Computer Science and Artificial Intelligence Laboratory]], combining government, corporate, and institutional funding and research efforts.<ref>{{cite web|url=http://bigdata.csail.mit.edu/ |title=Big Data @ CSAIL |publisher= Bigdata.csail.mit.edu |date=22 February 2013 |access-date=5 March 2013}}</ref>
The U.S. state of Massachusetts announced the Massachusetts Big Data Initiative in May 2012, which provides funding from the state government and private companies to a variety of research institutions. The Massachusetts Institute of Technology hosts the Intel Science and Technology Center for Big Data in the MIT Computer Science and Artificial Intelligence Laboratory, combining government, corporate, and institutional funding and research efforts.
美国马萨诸塞州在2012年5月宣布了马萨诸塞州大数据倡议,该倡议为各种研究机构提供来自州政府和私营公司的资金。麻省理工学院在麻省理工学院计算机科学和人工智能实验室中设有英特尔大数据科学技术中心,将政府、企业和机构的资金和研究成果结合在一起。
The European Commission is funding the two-year-long Big Data Public Private Forum through their Seventh Framework Program to engage companies, academics and other stakeholders in discussing big data issues. The project aims to define a strategy in terms of research and innovation to guide supporting actions from the European Commission in the successful implementation of the big data economy. Outcomes of this project will be used as input for [[Horizon 2020]], their next [[Framework Programmes for Research and Technological Development|framework program]].<ref>{{cite web |url=https://cordis.europa.eu/project/id/318062 |title=Big Data Public Private Forum |publisher=cordis.europa.eu |date=1 September 2012 |access-date=16 March 2020 }}</ref>
The European Commission is funding the two-year-long Big Data Public Private Forum through their Seventh Framework Program to engage companies, academics and other stakeholders in discussing big data issues. The project aims to define a strategy in terms of research and innovation to guide supporting actions from the European Commission in the successful implementation of the big data economy. Outcomes of this project will be used as input for Horizon 2020, their next framework program.
欧盟委员会正在通过其第七框架计划资助为期两年的大数据公私论坛,让公司、学术界和其他利益攸关方参与讨论大数据问题。该项目旨在确定研究和创新方面的战略,以指导欧洲委员会在成功实施大数据经济方面的支持行动。这个项目的成果将被用作地平线2020的投入,他们的下一个框架计划。
The British government announced in March 2014 the founding of the [[Alan Turing Institute]], named after the computer pioneer and code-breaker, which will focus on new ways to collect and analyze large data sets.<ref>{{cite news|url=https://www.bbc.co.uk/news/technology-26651179|title=Alan Turing Institute to be set up to research big data|work=[[BBC News]]|access-date=19 March 2014|date=19 March 2014}}</ref>
The British government announced in March 2014 the founding of the Alan Turing Institute, named after the computer pioneer and code-breaker, which will focus on new ways to collect and analyze large data sets.
2014年3月,英国政府宣布成立艾伦图灵研究院数据中心,该中心以计算机先驱和密码破译者的名字命名,将致力于研究收集和分析大型数据集的新方法。
At the [[University of Waterloo Stratford Campus]] Canadian Open Data Experience (CODE) Inspiration Day, participants demonstrated how using data visualization can increase the understanding and appeal of big data sets and communicate their story to the world.<ref>{{cite web|url= http://www.betakit.com/event/inspiration-day-at-university-of-waterloo-stratford-campus/| title=Inspiration day at University of Waterloo, Stratford Campus |publisher=betakit.com/ |access-date=28 February 2014}}</ref>
At the University of Waterloo Stratford Campus Canadian Open Data Experience (CODE) Inspiration Day, participants demonstrated how using data visualization can increase the understanding and appeal of big data sets and communicate their story to the world.
在滑铁卢大学斯特拉特福德校区加拿大开放数据体验(CODE)启发日上,与会者展示了如何使用数据可视化数据可以增加对大数据集的理解和吸引力,并向世界传达他们的故事。
[[Computational social science|Computational social sciences]] – Anyone can use application programming interfaces (APIs) provided by big data holders, such as Google and Twitter, to do research in the social and behavioral sciences.<ref name=pigdata>{{cite journal|last=Reips|first=Ulf-Dietrich|author2=Matzat, Uwe |title=Mining "Big Data" using Big Data Services |journal=International Journal of Internet Science |year=2014|volume=1|issue=1|pages=1–8 | url=http://www.ijis.net/ijis9_1/ijis9_1_editorial_pre.html}}</ref> Often these APIs are provided for free.<ref name="pigdata" /> [[Tobias Preis]] et al. used [[Google Trends]] data to demonstrate that Internet users from countries with a higher per capita gross domestic products (GDPs) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behaviors and real-world economic indicators.<ref>{{cite journal | vauthors = Preis T, Moat HS, Stanley HE, Bishop SR | title = Quantifying the advantage of looking forward | journal = Scientific Reports | volume = 2 | pages = 350 | year = 2012 | pmid = 22482034 | pmc = 3320057 | doi = 10.1038/srep00350 | bibcode = 2012NatSR...2E.350P }}</ref><ref>{{cite news | url=https://www.newscientist.com/article/dn21678-online-searches-for-future-linked-to-economic-success.html | title=Online searches for future linked to economic success |first=Paul |last=Marks |work=New Scientist | date=5 April 2012 | access-date=9 April 2012}}</ref><ref>{{cite news | url=https://arstechnica.com/gadgets/news/2012/04/google-trends-reveals-clues-about-the-mentality-of-richer-nations.ars | title=Google Trends reveals clues about the mentality of richer nations |first=Casey |last=Johnston |work=Ars Technica | date=6 April 2012 | access-date=9 April 2012}}</ref> The authors of the study examined Google queries logs made by ratio of the volume of searches for the coming year (2011) to the volume of searches for the previous year (2009), which they call the "[[future orientation index]]".<ref>{{cite web | url = http://www.tobiaspreis.de/bigdata/future_orientation_index.pdf | title = Supplementary Information: The Future Orientation Index is available for download | author = Tobias Preis | date = 24 May 2012 | access-date = 24 May 2012}}</ref> They compared the future orientation index to the per capita GDP of each country, and found a strong tendency for countries where Google users inquire more about the future to have a higher GDP.
Computational social sciences – Anyone can use application programming interfaces (APIs) provided by big data holders, such as Google and Twitter, to do research in the social and behavioral sciences. Often these APIs are provided for free. Tobias Preis et al. used Google Trends data to demonstrate that Internet users from countries with a higher per capita gross domestic products (GDPs) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behaviors and real-world economic indicators. The authors of the study examined Google queries logs made by ratio of the volume of searches for the coming year (2011) to the volume of searches for the previous year (2009), which they call the "future orientation index". They compared the future orientation index to the per capita GDP of each country, and found a strong tendency for countries where Google users inquire more about the future to have a higher GDP.
计算社会科学——任何人都可以使用大数据持有者(如谷歌和 Twitter)提供的应用程序编程接口(api)进行社会和行为科学研究。这些 api 通常是免费提供的。托拜厄斯 · 普雷斯等。使用谷歌趋势数据证明,来自人均国内生产总值(gdp)较高国家的互联网用户更有可能搜索有关未来的信息,而不是有关过去的信息。研究结果表明,在线行为和现实世界的经济指标之间可能存在某种联系。这项研究的作者审查了谷歌的查询日志,这些日志是根据下一年(2011年)的搜索量与上一年(2009年)的搜索量之比制作的,他们称之为“未来方向索引”。他们将未来方向指数与每个国家的人均 GDP 进行了比较,发现谷歌用户询问更多关于未来的信息的国家有一个更高的 GDP 趋势。
[[Tobias Preis]] and his colleagues Helen Susannah Moat and [[H. Eugene Stanley]] introduced a method to identify online precursors for stock market moves, using trading strategies based on search volume data provided by Google Trends.<ref>{{cite journal | url =http://www.nature.com/news/counting-google-searches-predicts-market-movements-1.12879 | title=Counting Google searches predicts market movements | author=Philip Ball | journal=Nature | date=26 April 2013 | doi=10.1038/nature.2013.12879 | s2cid=167357427 | access-date=9 August 2013| author-link=Philip Ball }}</ref> Their analysis of [[Google]] search volume for 98 terms of varying financial relevance, published in ''[[Scientific Reports]]'',<ref>{{cite journal | vauthors = Preis T, Moat HS, Stanley HE | title = Quantifying trading behavior in financial markets using Google Trends | journal = Scientific Reports | volume = 3 | pages = 1684 | year = 2013 | pmid = 23619126 | pmc = 3635219 | doi = 10.1038/srep01684 | bibcode = 2013NatSR...3E1684P }}</ref> suggests that increases in search volume for financially relevant search terms tend to precede large losses in financial markets.<ref>{{cite news | url=http://bits.blogs.nytimes.com/2013/04/26/google-search-terms-can-predict-stock-market-study-finds/ | title= Google Search Terms Can Predict Stock Market, Study Finds | author=Nick Bilton | work=[[The New York Times]] | date=26 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite magazine | url=http://business.time.com/2013/04/26/trouble-with-your-investment-portfolio-google-it/ | title=Trouble With Your Investment Portfolio? Google It! | author=Christopher Matthews | magazine=[[Time (magazine)|Time]] | date=26 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite journal | url= http://www.nature.com/news/counting-google-searches-predicts-market-movements-1.12879 | title=Counting Google searches predicts market movements | author=Philip Ball |journal=[[Nature (journal)|Nature]] | date=26 April 2013 | doi=10.1038/nature.2013.12879 | s2cid=167357427 | access-date=9 August 2013}}</ref><ref>{{cite news | url=http://www.businessweek.com/articles/2013-04-25/big-data-researchers-turn-to-google-to-beat-the-markets | title='Big Data' Researchers Turn to Google to Beat the Markets | author=Bernhard Warner | work=[[Bloomberg Businessweek]] | date=25 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite news | url=https://www.independent.co.uk/news/business/comment/hamish-mcrae/hamish-mcrae-need-a-valuable-handle-on-investor-sentiment-google-it-8590991.html | title=Hamish McRae: Need a valuable handle on investor sentiment? Google it | author=Hamish McRae | work=[[The Independent]] | date=28 April 2013 | access-date=9 August 2013 | location=London}}</ref><ref>{{cite web | url=http://www.ft.com/intl/cms/s/0/e5d959b8-acf2-11e2-b27f-00144feabdc0.html | title= Google search proves to be new word in stock market prediction | author=Richard Waters | work=[[Financial Times]] | date=25 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite news | url =https://www.bbc.co.uk/news/science-environment-22293693 | title=Google searches predict market moves | author=Jason Palmer | work=[[BBC]] | date=25 April 2013 | access-date=9 August 2013}}</ref>
Tobias Preis and his colleagues Helen Susannah Moat and H. Eugene Stanley introduced a method to identify online precursors for stock market moves, using trading strategies based on search volume data provided by Google Trends. Their analysis of Google search volume for 98 terms of varying financial relevance, published in Scientific Reports, suggests that increases in search volume for financially relevant search terms tend to precede large losses in financial markets.
Tobias Preis 和他的同事 Helen Susannah Moat 和 h. Eugene Stanley 介绍了一种方法,利用基于 Google Trends 提供的搜索量数据的交易策略来识别股市走势的在线前兆。他们在《科学报告》(Scientific Reports)上发表了对谷歌(Google)98个财务相关性不同的词条的搜索量分析,结果表明,财务相关搜索词的搜索量增加往往先于金融市场的巨额亏损。
Big data sets come with algorithmic challenges that previously did not exist. Hence, there is seen by some to be a need to fundamentally change the processing ways.<ref>E. Sejdić (March 2014). "Adapt current tools for use with big data". ''Nature''. '''507''' (7492): 306.</ref>
Big data sets come with algorithmic challenges that previously did not exist. Hence, there is seen by some to be a need to fundamentally change the processing ways.E. Sejdić (March 2014). "Adapt current tools for use with big data". Nature. 507 (7492): 306.
大数据集带来了以前不存在的算法挑战。因此,有些人认为有必要从根本上改变处理方式。Sejdi (2014年3月)。“调整现有工具,以便与大数据一起使用”。自然。507 (7492): 306.
The Workshops on Algorithms for Modern Massive Data Sets (MMDS) bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to discuss algorithmic challenges of big data.<ref>Stanford. [https://web.stanford.edu/group/mmds/ "MMDS. Workshop on Algorithms for Modern Massive Data Sets"].</ref> Regarding big data, such concepts of magnitude are relative. As it is stated "If the past is of any guidance, then today's big data most likely will not be considered as such in the near future."<ref name=CAD7challenges/>
The Workshops on Algorithms for Modern Massive Data Sets (MMDS) bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to discuss algorithmic challenges of big data.Stanford. "MMDS. Workshop on Algorithms for Modern Massive Data Sets". Regarding big data, such concepts of magnitude are relative. As it is stated "If the past is of any guidance, then today's big data most likely will not be considered as such in the near future."
现代海量数据集算法研讨会(MMDS)聚集了计算机科学家、统计学家、数学家和数据分析从业者,讨论大数据的算法挑战。斯坦福大学。“ MMDS。现代海量数据集算法研讨会”。对于大数据,这样的量级概念是相对的。正如文中所说: “如果说过去的数据有什么指导意义的话,那么今天的大数据在不久的将来很可能不会被认为是这样的。”
===Sampling big data===
A research question that is asked about big data sets is whether it is necessary to look at the full data to draw certain conclusions about the properties of the data or if is a sample is good enough. The name big data itself contains a term related to size and this is an important characteristic of big data. But [[Sampling (statistics)|sampling]] enables the selection of right data points from within the larger data set to estimate the characteristics of the whole population. In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage, and controller data are available at short time intervals. To predict downtime it may not be necessary to look at all the data but a sample may be sufficient. Big data can be broken down by various data point categories such as demographic, psychographic, behavioral, and transactional data. With large sets of data points, marketers are able to create and use more customized segments of consumers for more strategic targeting.
A research question that is asked about big data sets is whether it is necessary to look at the full data to draw certain conclusions about the properties of the data or if is a sample is good enough. The name big data itself contains a term related to size and this is an important characteristic of big data. But sampling enables the selection of right data points from within the larger data set to estimate the characteristics of the whole population. In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage, and controller data are available at short time intervals. To predict downtime it may not be necessary to look at all the data but a sample may be sufficient. Big data can be broken down by various data point categories such as demographic, psychographic, behavioral, and transactional data. With large sets of data points, marketers are able to create and use more customized segments of consumers for more strategic targeting.
关于大数据集,人们提出的一个研究问题是,是否有必要查看完整的数据,以便对数据的属性得出某些结论,或者样本是否足够好。大数据这个名称本身包含一个与规模相关的术语,这是大数据的一个重要特征。但是,抽样可以从较大的数据集中选择正确的数据点,以估计整个种群的特征。在制造不同类型的感官数据,如声学,振动,压力,电流,电压和控制器数据可在短时间间隔。为了预测停机时间,可能不需要查看所有的数据,但是一个样本就足够了。大数据可以按照不同的数据点分类,如人口统计学、心理学、行为学和交易数据。有了大量的数据点,营销人员就能够创造和使用更多的定制的消费者细分市场,从而实现更具战略性的目标。
There has been some work done in sampling algorithms for big data. A theoretical formulation for sampling Twitter data has been developed.<ref>{{cite conference |author1=Deepan Palguna |author2= Vikas Joshi |author3=Venkatesan Chakravarthy |author4=Ravi Kothari |author5=L. V. Subramaniam |name-list-style=amp | title=Analysis of Sampling Algorithms for Twitter | journal=[[International Joint Conference on Artificial Intelligence]] | year=2015 }}</ref>
There has been some work done in sampling algorithms for big data. A theoretical formulation for sampling Twitter data has been developed.
在大数据的抽样算法方面已经做了一些工作。已经开发了一个抽样 Twitter 数据的理论公式。
==Critique==
Critiques of the big data paradigm come in two flavors: those that question the implications of the approach itself, and those that question the way it is currently done.<ref name="Kimble and Milolidakis (2015)">{{Cite Q|Q56532925}}</ref> One approach to this criticism is the field of [[critical data studies]].
Critiques of the big data paradigm come in two flavors: those that question the implications of the approach itself, and those that question the way it is currently done. One approach to this criticism is the field of critical data studies.
对大数据范式的批评有两种: 一种质疑方法本身的含义,另一种质疑目前的方法。批评的一个方法是批判性数据研究领域。
===Critiques of the big data paradigm===
"A crucial problem is that we do not know much about the underlying empirical micro-processes that lead to the emergence of the[se] typical network characteristics of Big Data."<ref name="Editorial" /> In their critique, Snijders, Matzat, and [[Ulf-Dietrich Reips|Reips]] point out that often very strong assumptions are made about mathematical properties that may not at all reflect what is really going on at the level of micro-processes. Mark Graham has leveled broad critiques at [[Chris Anderson (writer)|Chris Anderson]]'s assertion that big data will spell the end of theory:<ref>{{Cite magazine|url=https://www.wired.com/science/discoveries/magazine/16-07/pb_theory|title=The End of Theory: The Data Deluge Makes the Scientific Method Obsolete|author=Chris Anderson|date=23 June 2008|magazine=Wired}}</ref> focusing in particular on the notion that big data must always be contextualized in their social, economic, and political contexts.<ref>{{cite news |author=Graham M. |title=Big data and the end of theory? |newspaper=The Guardian |url= https://www.theguardian.com/news/datablog/2012/mar/09/big-data-theory |location=London |date=9 March 2012}}</ref> Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, big data, no matter how comprehensive or well analyzed, must be complemented by "big judgment", according to an article in the ''[[Harvard Business Review]]''.<ref>{{cite journal|title=Good Data Won't Guarantee Good Decisions |journal=[[Harvard Business Review]]|url=http://hbr.org/2012/04/good-data-wont-guarantee-good-decisions/ar/1|author=Shah, Shvetank|author2=Horne, Andrew|author3=Capellá, Jaime |access-date=8 September 2012|date=April 2012}}</ref>
"A crucial problem is that we do not know much about the underlying empirical micro-processes that lead to the emergence of the[se] typical network characteristics of Big Data." In their critique, Snijders, Matzat, and Reips point out that often very strong assumptions are made about mathematical properties that may not at all reflect what is really going on at the level of micro-processes. Mark Graham has leveled broad critiques at Chris Anderson's assertion that big data will spell the end of theory: focusing in particular on the notion that big data must always be contextualized in their social, economic, and political contexts. Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, big data, no matter how comprehensive or well analyzed, must be complemented by "big judgment", according to an article in the Harvard Business Review.
= = = 对大数据范式的批评 = = = “一个关键问题是,我们对导致出现大数据的典型网络特征的潜在经验微过程知之甚少。”斯奈德斯、马扎特和瑞普斯在他们的评论中指出,通常对数学性质做出的非常强有力的假设,可能根本不能反映微过程层面的真实情况。马克 · 格雷厄姆对克里斯 · 安德森断言大数据将意味着理论的终结提出了广泛的批评: 特别关注大数据必须始终与其社会、经济和政治背景相联系的概念。尽管企业投入了8位数和9位数的资金,从供应商和客户源源不断的信息中获取洞察力,但只有不到40% 的员工拥有足够成熟的流程和技能来做到这一点。《哈佛商业评论》(Harvard Business Review)的一篇文章指出,为了克服这种洞察力不足,无论大数据分析得多么全面,多么精确,都必须辅之以“大判断力”。
Much in the same line, it has been pointed out that the decisions based on the analysis of big data are inevitably "informed by the world as it was in the past, or, at best, as it currently is".<ref name="HilbertBigData2013">Hilbert, M. (2016). Big Data for Development: A Review of Promises and Challenges. Development Policy Review, 34(1), 135–174. https://doi.org/10.1111/dpr.12142 free access: https://www.martinhilbert.net/big-data-for-development/</ref> Fed by a large number of data on past experiences, algorithms can predict future development if the future is similar to the past.<ref name="HilbertTEDx">[https://www.youtube.com/watch?v=UXef6yfJZAI Big Data requires Big Visions for Big Change.], Hilbert, M. (2014). London: TEDx UCL, x=independently organized TED talks</ref> If the system's dynamics of the future change (if it is not a [[stationary process]]), the past can say little about the future. In order to make predictions in changing environments, it would be necessary to have a thorough understanding of the systems dynamic, which requires theory.<ref name="HilbertTEDx"/> As a response to this critique Alemany Oliver and Vayre suggest to use "abductive reasoning as a first step in the research process in order to bring context to consumers' digital traces and make new theories emerge".<ref>{{cite journal|last=Alemany Oliver|first=Mathieu |author2=Vayre, Jean-Sebastien |s2cid=111360835 |title= Big Data and the Future of Knowledge Production in Marketing Research: Ethics, Digital Traces, and Abductive Reasoning|journal=Journal of Marketing Analytics |year=2015|volume=3|issue=1|doi= 10.1057/jma.2015.1|pages=5–13}}</ref>
Additionally, it has been suggested to combine big data approaches with computer simulations, such as [[agent-based model]]s<ref name="HilbertBigData2013" /> and [[complex systems]]. Agent-based models are increasingly getting better in predicting the outcome of social complexities of even unknown future scenarios through computer simulations that are based on a collection of mutually interdependent algorithms.<ref>{{cite web|url= https://www.theatlantic.com/magazine/archive/2002/04/seeing-around-corners/302471/| title=Seeing Around Corners|author=Jonathan Rauch|date=1 April 2002|work=[[The Atlantic]]}}</ref><ref>Epstein, J. M., & Axtell, R. L. (1996). Growing Artificial Societies: Social Science from the Bottom Up. A Bradford Book.</ref> Finally, the use of multivariate methods that probe for the latent structure of the data, such as [[factor analysis]] and [[cluster analysis]], have proven useful as analytic approaches that go well beyond the bi-variate approaches (e.g. [[Contingency table|contingency tables]]) typically employed with smaller data sets.
Much in the same line, it has been pointed out that the decisions based on the analysis of big data are inevitably "informed by the world as it was in the past, or, at best, as it currently is".Hilbert, M. (2016). Big Data for Development: A Review of Promises and Challenges. Development Policy Review, 34(1), 135–174. https://doi.org/10.1111/dpr.12142 free access: https://www.martinhilbert.net/big-data-for-development/ Fed by a large number of data on past experiences, algorithms can predict future development if the future is similar to the past.Big Data requires Big Visions for Big Change., Hilbert, M. (2014). London: TEDx UCL, x=independently organized TED talks If the system's dynamics of the future change (if it is not a stationary process), the past can say little about the future. In order to make predictions in changing environments, it would be necessary to have a thorough understanding of the systems dynamic, which requires theory. As a response to this critique Alemany Oliver and Vayre suggest to use "abductive reasoning as a first step in the research process in order to bring context to consumers' digital traces and make new theories emerge".
Additionally, it has been suggested to combine big data approaches with computer simulations, such as agent-based models and complex systems. Agent-based models are increasingly getting better in predicting the outcome of social complexities of even unknown future scenarios through computer simulations that are based on a collection of mutually interdependent algorithms.Epstein, J. M., & Axtell, R. L. (1996). Growing Artificial Societies: Social Science from the Bottom Up. A Bradford Book. Finally, the use of multivariate methods that probe for the latent structure of the data, such as factor analysis and cluster analysis, have proven useful as analytic approaches that go well beyond the bi-variate approaches (e.g. contingency tables) typically employed with smaller data sets.
与此类似,有人指出,基于大数据分析的决策不可避免地“受到过去世界的影响,或者充其量受到现在世界的影响”。希尔伯特(2016)。大数据促进发展: 承诺与挑战述评。发展政策检讨,34(1) ,135-174。Https://doi.org/10.1111/dpr.12142免费访问: 由过去经验的大量数据提供的 https://www.martinhilbert.net/big-data-for-development/ ,算法可以预测未来的发展,如果未来类似于过去。大数据需要大变化的远见,希尔伯特,m. (2014)。伦敦: TEDx 伦敦大学学院,x = 独立组织的 TED 演讲如果系统对未来的动态变化(如果不是一个平稳过程) ,过去对未来的影响微乎其微。为了在不断变化的环境中做出预测,需要对系统的动态性有一个透彻的理解,这需要理论。作为对这种批评的回应,Alemany Oliver 和 Vayre 建议使用“溯因推理作为研究过程中的第一步,以便为消费者的数字痕迹提供背景,并产生新的理论”。此外,有人建议将大数据方法与计算机模拟相结合,如基于主体的模型和复杂系统。基于代理的模型越来越能够通过基于一组相互依赖的算法的计算机模拟来预测未来未知情况下的社会复杂性的结果。爱泼斯坦,j. m. ,& Axtell,r. l. (1996)。成长中的人工社会: 自下而上的社会科学。一本布拉德福德的书。最后,使用多变量方法探测数据的潜在结构,如因子分析和数据聚类分析,已被证明是有用的分析方法,远远超出了双变量方法(例如:。列联表)通常用于较小的数据集。
In health and biology, conventional scientific approaches are based on experimentation. For these approaches, the limiting factor is the relevant data that can confirm or refute the initial hypothesis.<ref>{{cite web|url=http://www.bigdataparis.com/documents/Pierre-Delort-INSERM.pdf#page=5| title=Delort P., Big data in Biosciences, Big Data Paris, 2012|website =Bigdataparis.com |access-date=8 October 2017}}</ref>
A new postulate is accepted now in biosciences: the information provided by the data in huge volumes ([[omics]]) without prior hypothesis is complementary and sometimes necessary to conventional approaches based on experimentation.<ref>{{cite web|url=https://www.cs.cmu.edu/~durand/03-711/2011/Literature/Next-Gen-Genomics-NRG-2010.pdf|title=Next-generation genomics: an integrative approach|date=July 2010|publisher=nature|access-date=18 October 2016}}</ref><ref>{{cite web|url= https://www.researchgate.net/publication/283298499|title=Big Data in Biosciences| date=October 2015|access-date=18 October 2016}}</ref> In the massive approaches it is the formulation of a relevant hypothesis to explain the data that is the limiting factor.<ref>{{cite news|url=https://next.ft.com/content/21a6e7d8-b479-11e3-a09a-00144feabdc0|title=Big data: are we making a big mistake?|date=28 March 2014|work=Financial Times|access-date=20 October 2016}}</ref> The search logic is reversed and the limits of induction ("Glory of Science and Philosophy scandal", [[C. D. Broad]], 1926) are to be considered.{{Citation needed|date=April 2015}}
In health and biology, conventional scientific approaches are based on experimentation. For these approaches, the limiting factor is the relevant data that can confirm or refute the initial hypothesis.
A new postulate is accepted now in biosciences: the information provided by the data in huge volumes (omics) without prior hypothesis is complementary and sometimes necessary to conventional approaches based on experimentation. In the massive approaches it is the formulation of a relevant hypothesis to explain the data that is the limiting factor. The search logic is reversed and the limits of induction ("Glory of Science and Philosophy scandal", C. D. Broad, 1926) are to be considered.
在健康和生物学领域,传统的科学方法是建立在实验的基础上的。对于这些方法,限制因素是相关的数据,可以证实或反驳最初的假设。生物科学现在接受了一个新的假设: 没有事先假设的大量数据(组学)所提供的信息是互补的,有时是基于实验的传统方法所必需的。在大量的方法中,它是一个相关假设的表述,以解释数据,这是限制因素。搜索的逻辑是颠倒的,归纳法的局限性(“科学的荣耀与哲学的丑闻”,C.d. 布罗德,1926)是需要考虑的。
[[Consumer privacy|Privacy]] advocates are concerned about the threat to privacy represented by increasing storage and integration of [[personally identifiable information]]; expert panels have released various policy recommendations to conform practice to expectations of privacy.<ref>{{cite magazine |first=Paul |last=Ohm |title=Don't Build a Database of Ruin |magazine=Harvard Business Review |url=http://blogs.hbr.org/cs/2012/08/dont_build_a_database_of_ruin.html|date=23 August 2012 }}</ref> The misuse of big data in several cases by media, companies, and even the government has allowed for abolition of trust in almost every fundamental institution holding up society.<ref>Bond-Graham, Darwin (2018). [https://www.theperspective.com/debates/the-perspective-on-big-data/ "The Perspective on Big Data"]. [[The Perspective]].</ref>
Privacy advocates are concerned about the threat to privacy represented by increasing storage and integration of personally identifiable information; expert panels have released various policy recommendations to conform practice to expectations of privacy. The misuse of big data in several cases by media, companies, and even the government has allowed for abolition of trust in almost every fundamental institution holding up society.Bond-Graham, Darwin (2018). "The Perspective on Big Data". The Perspective.
隐私权倡导者担心隐私权受到威胁,这种威胁表现在个人身份信息的存储和整合不断增加; 专家小组已经发布了各种政策建议,使实践符合隐私权的期望。媒体、公司甚至政府在几个案例中滥用大数据,导致几乎所有支撑社会的基础机构都失去了信任。邦德-格雷厄姆,达尔文(2018)。“大数据透视”。透视法。
Nayef Al-Rodhan argues that a new kind of social contract will be needed to protect individual liberties in the context of big data and giant corporations that own vast amounts of information, and that the use of big data should be monitored and better regulated at the national and international levels.<ref>{{Cite news|url=http://hir.harvard.edu/the-social-contract-2-0-big-data-and-the-need-to-guarantee-privacy-and-civil-liberties/|title=The Social Contract 2.0: Big Data and the Need to Guarantee Privacy and Civil Liberties – Harvard International Review|last=Al-Rodhan|first=Nayef|date=16 September 2014|work=Harvard International Review|access-date=3 April 2017|archive-url=https://web.archive.org/web/20170413090835/http://hir.harvard.edu/the-social-contract-2-0-big-data-and-the-need-to-guarantee-privacy-and-civil-liberties/|archive-date=13 April 2017|url-status=dead}}</ref> Barocas and Nissenbaum argue that one way of protecting individual users is by being informed about the types of information being collected, with whom it is shared, under what constraints and for what purposes.<ref>{{Cite book|title=Big Data's End Run around Anonymity and Consent| last1 =Barocas |first1=Solon |last2=Nissenbaum |first2=Helen|last3=Lane|first3=Julia|last4=Stodden|first4=Victoria|last5=Bender|first5=Stefan|last6=Nissenbaum|first6=Helen| s2cid =152939392|date=June 2014| publisher =Cambridge University Press|isbn=9781107067356|pages=44–75|doi =10.1017/cbo9781107590205.004}}</ref>
Nayef Al-Rodhan argues that a new kind of social contract will be needed to protect individual liberties in the context of big data and giant corporations that own vast amounts of information, and that the use of big data should be monitored and better regulated at the national and international levels. Barocas and Nissenbaum argue that one way of protecting individual users is by being informed about the types of information being collected, with whom it is shared, under what constraints and for what purposes.
纳耶夫 · 阿尔罗德汉认为,在拥有大量信息的大数据和巨型公司的背景下,需要一种新型的社会契约来保护个人自由,大数据的使用应该在国家和国际层面受到监督和更好的管理。巴罗卡斯和尼森鲍姆认为,保护个人用户的一种方法是了解收集的信息类型、与谁共享、受到何种限制以及用于何种目的。
===Critiques of the "V" model===
The "V" model of big data is concerning as it centers around computational scalability and lacks in a loss around the perceptibility and understandability of information. This led to the framework of [[cognitive big data]], which characterizes big data applications according to:<ref>{{Cite journal|last1=Lugmayr|first1=Artur|last2=Stockleben|first2=Bjoern|last3=Scheib|first3=Christoph|last4=Mailaparampil|first4=Mathew|last5=Mesia|first5=Noora|last6=Ranta|first6=Hannu|last7=Lab|first7=Emmi|date=1 June 2016|title=A Comprehensive Survey On Big-Data Research and Its Implications – What is Really 'New' in Big Data? – It's Cognitive Big Data! |url=https://www.researchgate.net/publication/304784955}}</ref>
* Data completeness: understanding of the non-obvious from data
* Data correlation, causation, and predictability: causality as not essential requirement to achieve predictability
* Explainability and interpretability: humans desire to understand and accept what they understand, where algorithms do not cope with this
* Level of automated decision making: algorithms that support automated decision making and algorithmic self-learning
The "V" model of big data is concerning as it centers around computational scalability and lacks in a loss around the perceptibility and understandability of information. This led to the framework of cognitive big data, which characterizes big data applications according to:
* Data completeness: understanding of the non-obvious from data
* Data correlation, causation, and predictability: causality as not essential requirement to achieve predictability
* Explainability and interpretability: humans desire to understand and accept what they understand, where algorithms do not cope with this
* Level of automated decision making: algorithms that support automated decision making and algorithmic self-learning
= = = 对“ v”模型的批评 = = = 大数据的“ v”模型关注的是它围绕着计算的可扩展性,缺乏围绕信息的可感知性和可理解性的损失。这导致了认知大数据的框架,它描述了大数据应用的特征:
* 数据的完整性: 从数据中理解不明显的东西
* 数据的相关性、因果关系和可预测性: 因果关系不是实现可预测性和可解释性的必要条件
* 解释性和可解释性: 人类渴望理解和接受他们所理解的东西,而算法不能处理这个
* 自动决策层: 支持自动决策和自我学习的算法
===Critiques of novelty===
Large data sets have been analyzed by computing machines for well over a century, including the US census analytics performed by [[IBM]]'s punch-card machines which computed statistics including means and variances of populations across the whole continent. In more recent decades, science experiments such as [[CERN]] have produced data on similar scales to current commercial "big data". However, science experiments have tended to analyze their data using specialized custom-built [[high-performance computing]] (super-computing) clusters and grids, rather than clouds of cheap commodity computers as in the current commercial wave, implying a difference in both culture and technology stack.
Large data sets have been analyzed by computing machines for well over a century, including the US census analytics performed by IBM's punch-card machines which computed statistics including means and variances of populations across the whole continent. In more recent decades, science experiments such as CERN have produced data on similar scales to current commercial "big data". However, science experiments have tended to analyze their data using specialized custom-built high-performance computing (super-computing) clusters and grids, rather than clouds of cheap commodity computers as in the current commercial wave, implying a difference in both culture and technology stack.
= = = 对新奇性的批评 = = = 大型数据集已经通过计算机进行了一个多世纪的分析,包括美国人口普查分析,由 IBM 的打孔卡片机进行,计算统计数据,包括整个大陆人口的均值和方差。近几十年来,欧洲核子研究中心(CERN)等科学实验所产生的数据规模与当前的商业“大数据”类似。然而,科学实验倾向于使用专门定制的高性能计算(超级计算)集群和网格来分析数据,而不是像当前商业浪潮中那样使用廉价的商品计算机云,这意味着文化和技术层面的差异。
===Critiques of big data execution===
[[Ulf-Dietrich Reips]] and Uwe Matzat wrote in 2014 that big data had become a "fad" in scientific research.<ref name="pigdata" /> Researcher [[danah boyd]] has raised concerns about the use of big data in science neglecting principles such as choosing a [[Sampling (statistics)|representative sample]] by being too concerned about handling the huge amounts of data.<ref name="danah">{{cite web | url=http://www.danah.org/papers/talks/2010/WWW2010.html | title=Privacy and Publicity in the Context of Big Data | author=danah boyd | work=[[World Wide Web Conference|WWW 2010 conference]] | date=29 April 2010 | access-date = 18 April 2011| author-link=danah boyd }}</ref> This approach may lead to results that have a [[Bias (statistics)|bias]] in one way or another.<ref>{{Cite journal|last=Katyal|first=Sonia K.|date=2019|title=Artificial Intelligence, Advertising, and Disinformation|url=https://muse.jhu.edu/article/745987|journal=Advertising & Society Quarterly|language=en|volume=20|issue=4|doi=10.1353/asr.2019.0026|s2cid=213397212|issn=2475-1790}}</ref> Integration across heterogeneous data resources—some that might be considered big data and others not—presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science.<ref>{{cite journal |last1=Jones |first1=MB |last2=Schildhauer |first2=MP |last3=Reichman |first3=OJ |last4=Bowers | first4=S |title=The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere | journal=Annual Review of Ecology, Evolution, and Systematics |volume=37 |issue=1 |pages=519–544 |year=2006 |doi=10.1146/annurev.ecolsys.37.091305.110031 |url= http://www.pnamp.org/sites/default/files/Jones2006_AREES.pdf }}</ref>
In the provocative article "Critical Questions for Big Data",<ref name="danah2">{{cite journal | doi = 10.1080/1369118X.2012.678878| title = Critical Questions for Big Data| journal = Information, Communication & Society| volume = 15| issue = 5| pages = 662–679| year = 2012| last1 = Boyd | first1 = D. | last2 = Crawford | first2 = K. | s2cid = 51843165| hdl = 10983/1320| hdl-access = free}}</ref> the authors title big data a part of [[mythology]]: "large data sets offer a higher form of intelligence and knowledge [...], with the aura of truth, objectivity, and accuracy". Users of big data are often "lost in the sheer volume of numbers", and "working with Big Data is still subjective, and what it quantifies does not necessarily have a closer claim on objective truth".<ref name="danah2" /> Recent developments in BI domain, such as pro-active reporting especially target improvements in the usability of big data, through automated [[Filter (software)|filtering]] of [[spurious relationship|non-useful data and correlations]].<ref name="Big Decisions White Paper">[http://www.fortewares.com/Administrator/userfiles/Banner/forte-wares--pro-active-reporting_EN.pdf Failure to Launch: From Big Data to Big Decisions] {{Webarchive|url=https://web.archive.org/web/20161206145026/http://www.fortewares.com/Administrator/userfiles/Banner/forte-wares--pro-active-reporting_EN.pdf |date=6 December 2016 }}, Forte Wares.</ref> Big structures are full of spurious correlations<ref>{{Cite web | url=https://www.tylervigen.com/spurious-correlations | title=15 Insane Things That Correlate with Each Other}}</ref> either because of non-causal coincidences ([[law of truly large numbers]]), solely nature of big randomness<ref>[https://onlinelibrary.wiley.com/loi/10982418 Random structures & algorithms]</ref> ([[Ramsey theory]]), or existence of [[confounding factor|non-included factors]] so the hope, of early experimenters to make large databases of numbers "speak for themselves" and revolutionize scientific method, is questioned.<ref>Cristian S. Calude, Giuseppe Longo, (2016), The Deluge of Spurious Correlations in Big Data, [[Foundations of Science]]</ref>
Ulf-Dietrich Reips and Uwe Matzat wrote in 2014 that big data had become a "fad" in scientific research. Researcher danah boyd has raised concerns about the use of big data in science neglecting principles such as choosing a representative sample by being too concerned about handling the huge amounts of data. This approach may lead to results that have a bias in one way or another. Integration across heterogeneous data resources—some that might be considered big data and others not—presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science.
In the provocative article "Critical Questions for Big Data", the authors title big data a part of mythology: "large data sets offer a higher form of intelligence and knowledge [...], with the aura of truth, objectivity, and accuracy". Users of big data are often "lost in the sheer volume of numbers", and "working with Big Data is still subjective, and what it quantifies does not necessarily have a closer claim on objective truth". Recent developments in BI domain, such as pro-active reporting especially target improvements in the usability of big data, through automated filtering of non-useful data and correlations.Failure to Launch: From Big Data to Big Decisions , Forte Wares. Big structures are full of spurious correlations either because of non-causal coincidences (law of truly large numbers), solely nature of big randomnessRandom structures & algorithms (Ramsey theory), or existence of non-included factors so the hope, of early experimenters to make large databases of numbers "speak for themselves" and revolutionize scientific method, is questioned.Cristian S. Calude, Giuseppe Longo, (2016), The Deluge of Spurious Correlations in Big Data, Foundations of Science
= = = 对大数据执行的批评 = = = Ulf-Dietrich Reips 和 Uwe Matzat 在2014年写道,大数据已经成为科学研究的“时尚”。研究人员 danah boyd 对大数据在科学中的应用表示担忧,他忽视了一些原则,比如过于关注海量数据的处理而选择了具有代表性的样本。这种方法可能会导致在某种程度上存在偏见的结果。跨越不同种类的数据资源(有些可能被认为是大数据,有些则不是)的整合带来了巨大的逻辑和分析挑战,但许多研究人员认为,这种整合可能代表了科学界最有前途的新领域。在这篇颇具煽动性的文章《大数据的关键问题》(Critical Questions for Big Data)中,作者将大数据称为神话的一部分: “大数据集提供了更高形式的智力和知识[ ... ... ] ,带有真实、客观和准确的光环。”。大数据的使用者往往”迷失在庞大的数字中”,而且”使用大数据仍然是主观的,它量化的东西不一定能够更接近客观事实”。BI 领域的最新发展,例如前瞻性报告,特别是通过自动过滤非有用数据和相关性提高大数据的可用性。发布失败: 从大数据到重大决策,Forte Wares。大结构充满了虚假的相关性,要么是由于非因果巧合(真正的大数定律) ,大随机结构和算法(拉姆齐理论)的唯一性,要么是由于非包含因素的存在,因此,早期实验者使大型数据库“为自己说话”和革命性的科学方法的希望受到了质疑。克里斯蒂安 · s · 卡劳德,朱塞佩 · 隆戈,(2016) ,《大数据中伪相关性的泛滥》 ,《科学基础》
Big data analysis is often shallow compared to analysis of smaller data sets.<ref name="kdnuggets-berchthold">{{cite web|url=http://www.kdnuggets.com/2014/08/interview-michael-berthold-knime-research-big-data-privacy-part2.html|title=Interview: Michael Berthold, KNIME Founder, on Research, Creativity, Big Data, and Privacy, Part 2|date=12 August 2014|author=Gregory Piatetsky| author-link= Gregory I. Piatetsky-Shapiro|publisher=KDnuggets|access-date=13 August 2014}}</ref> In many big data projects, there is no large data analysis happening, but the challenge is the [[extract, transform, load]] part of data pre-processing.<ref name="kdnuggets-berchthold" />
Big data analysis is often shallow compared to analysis of smaller data sets. In many big data projects, there is no large data analysis happening, but the challenge is the extract, transform, load part of data pre-processing.
大数据分析与小数据集分析相比往往是肤浅的。在许多大数据项目中,没有大型的数据分析发生,但是挑战在于提取、转换和加载数据预处理数据的部分。
Big data is a [[buzzword]] and a "vague term",<ref>{{cite news|last1=Pelt|first1=Mason|title="Big Data" is an over used buzzword and this Twitter bot proves it|url= http://siliconangle.com/blog/2015/10/26/big-data-is-an-over-used-buzzword-and-this-twitter-bot-proves-it/ |newspaper=Siliconangle|access-date=4 November 2015|date=26 October 2015}}</ref><ref name="ft-harford">{{cite web |url=http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html |title=Big data: are we making a big mistake? |last1=Harford |first1=Tim |date=28 March 2014 |website=[[Financial Times]] |access-date=7 April 2014}}</ref> but at the same time an "obsession"<ref name="ft-harford" /> with entrepreneurs, consultants, scientists, and the media. Big data showcases such as [[Google Flu Trends]] failed to deliver good predictions in recent years, overstating the flu outbreaks by a factor of two. Similarly, [[Academy awards]] and election predictions solely based on Twitter were more often off than on target.
Big data often poses the same challenges as small data; adding more data does not solve problems of bias, but may emphasize other problems. In particular data sources such as Twitter are not representative of the overall population, and results drawn from such sources may then lead to wrong conclusions. [[Google Translate]]—which is based on big data statistical analysis of text—does a good job at translating web pages. However, results from specialized domains may be dramatically skewed.
On the other hand, big data may also introduce new problems, such as the [[multiple comparisons problem]]: simultaneously testing a large set of hypotheses is likely to produce many false results that mistakenly appear significant.
Ioannidis argued that "most published research findings are false"<ref name="Ioannidis">{{cite journal | vauthors = Ioannidis JP | title = Why most published research findings are false | journal = PLOS Medicine | volume = 2 | issue = 8 | pages = e124 | date = August 2005 | pmid = 16060722 | pmc = 1182327 | doi = 10.1371/journal.pmed.0020124 | author-link1 = John P. A. Ioannidis }}</ref> due to essentially the same effect: when many scientific teams and researchers each perform many experiments (i.e. process a big amount of scientific data; although not with big data technology), the likelihood of a "significant" result being false grows fast – even more so, when only positive results are published.
<!-- sorry, this started overlapping with above section more and more... merging is welcome; I already dropped the intended subheadline "Hype cycle and inflated expectations". -->
Furthermore, big data analytics results are only as good as the model on which they are predicated. In an example, big data took part in attempting to predict the results of the 2016 U.S. Presidential Election<ref>{{Cite news|url=https://www.nytimes.com/2016/11/10/technology/the-data-said-clinton-would-win-why-you-shouldnt-have-believed-it.html|title=How Data Failed Us in Calling an Election |last1=Lohr|first1=Steve|date=10 November 2016|last2=Singer|first2=Natasha|newspaper=The New York Times|issn=0362-4331|access-date=27 November 2016}}</ref> with varying degrees of success.
Big data is a buzzword and a "vague term", but at the same time an "obsession" with entrepreneurs, consultants, scientists, and the media. Big data showcases such as Google Flu Trends failed to deliver good predictions in recent years, overstating the flu outbreaks by a factor of two. Similarly, Academy awards and election predictions solely based on Twitter were more often off than on target.
Big data often poses the same challenges as small data; adding more data does not solve problems of bias, but may emphasize other problems. In particular data sources such as Twitter are not representative of the overall population, and results drawn from such sources may then lead to wrong conclusions. Google Translate—which is based on big data statistical analysis of text—does a good job at translating web pages. However, results from specialized domains may be dramatically skewed.
On the other hand, big data may also introduce new problems, such as the multiple comparisons problem: simultaneously testing a large set of hypotheses is likely to produce many false results that mistakenly appear significant.
Ioannidis argued that "most published research findings are false" due to essentially the same effect: when many scientific teams and researchers each perform many experiments (i.e. process a big amount of scientific data; although not with big data technology), the likelihood of a "significant" result being false grows fast – even more so, when only positive results are published.
Furthermore, big data analytics results are only as good as the model on which they are predicated. In an example, big data took part in attempting to predict the results of the 2016 U.S. Presidential Election with varying degrees of success.
大数据是一个时髦词汇和“模糊词汇”,但同时也是企业家、咨询师、科学家和媒体的“迷恋”。像谷歌流感趋势这样的大数据展示在最近几年未能提供好的预测,将流感爆发夸大了两倍。同样,仅仅基于推特的奥斯卡奖和选举预测往往不准确。大数据往往会带来与小数据相同的挑战; 增加更多的数据并不能解决偏差问题,但可能会强调其他问题。特别是像推特这样的数据来源并不能代表整个人口,从这些来源得出的结果可能会导致错误的结论。基于大数据文本统计分析的谷歌翻译(Google translate)在网页翻译方面做得很好。然而,来自专门领域的结果可能被严重扭曲。另一方面,大数据也可能引入新的问题,比如多重比较问题: 同时测试大量假设可能会产生许多错误的结果,错误地显得意义重大。约阿尼迪斯认为,“大多数已发表的研究结果都是错误的”,其原因基本上是相同的: 当许多科学团队和研究人员各自进行许多实验(即。处理大量的科学数据,尽管不是使用大数据技术) ,“显著”结果是错误的可能性快速增长——当只有正面的结果被公布时,这种可能性更大。此外,大数据分析的结果只能和它们所预测的模型一样好。举个例子,大数据试图预测2016年美国总统大选的结果,但却取得了不同程度的成功。
=== Critiques of big data policing and surveillance ===
Big data has been used in policing and surveillance by institutions like [[Law enforcement in the United States|law enforcement]] and [[Corporate surveillance|corporations]].<ref>{{Cite news|url=https://www.economist.com/open-future/2018/06/04/how-data-driven-policing-threatens-human-freedom|title=How data-driven policing threatens human freedom|date=4 June 2018|newspaper=The Economist|access-date=27 October 2019|issn=0013-0613}}</ref> Due to the less visible nature of data-based surveillance as compared to traditional methods of policing, objections to big data policing are less likely to arise. According to Sarah Brayne's ''Big Data Surveillance: The Case of Policing'',<ref>{{Cite journal|last=Brayne|first=Sarah|s2cid=3609838|date=29 August 2017|title=Big Data Surveillance: The Case of Policing|journal=American Sociological Review |volume=82|issue=5|pages=977–1008|language=en|doi=10.1177/0003122417725865}}</ref> big data policing can reproduce existing [[Social inequality|societal inequalities]] in three ways:
Big data has been used in policing and surveillance by institutions like law enforcement and corporations. Due to the less visible nature of data-based surveillance as compared to traditional methods of policing, objections to big data policing are less likely to arise. According to Sarah Brayne's Big Data Surveillance: The Case of Policing, big data policing can reproduce existing societal inequalities in three ways:
= = = 批评大数据监管和监视 = = = 大数据被执法和公司等机构用于监管和监视。由于与传统的警务方法相比,基于数据的监督不那么明显,因此不太可能出现对大数据警务的反对意见。根据 Sarah Brayne 的《大数据监控: 警务案例》 ,大数据监控可以通过三种方式重现现存的社会不平等:
* Placing suspected criminals under increased surveillance by using the justification of a mathematical and therefore unbiased algorithm
* Increasing the scope and number of people that are subject to law enforcement tracking and exacerbating existing [[Race in the United States criminal justice system#Racial inequality in incarceration|racial overrepresentation]] in the criminal justice system
* Encouraging members of society to abandon interactions with institutions that would create a digital trace, thus creating obstacles to social inclusion
* Placing suspected criminals under increased surveillance by using the justification of a mathematical and therefore unbiased algorithm
* Increasing the scope and number of people that are subject to law enforcement tracking and exacerbating existing racial overrepresentation in the criminal justice system
* Encouraging members of society to abandon interactions with institutions that would create a digital trace, thus creating obstacles to social inclusion
* 利用一种数学的、因此是不偏不倚的算法理由,加强对犯罪嫌疑人的监视
* 增加执法跟踪的范围和人数,加剧刑事司法系统中现有的种族过度代表性
* 鼓励社会成员放弃与会产生数字痕迹的机构的互动,从而为社会包容制造障碍
If these potential problems are not corrected or regulated, the effects of big data policing may continue to shape societal hierarchies. Conscientious usage of big data policing could prevent individual level biases from becoming institutional biases, Brayne also notes.
If these potential problems are not corrected or regulated, the effects of big data policing may continue to shape societal hierarchies. Conscientious usage of big data policing could prevent individual level biases from becoming institutional biases, Brayne also notes.
如果这些潜在的问题得不到纠正或规范,大数据监管的影响可能会继续塑造社会等级。布莱恩还指出,尽责地使用大数据监管可以防止个人层面的偏见成为制度层面的偏见。
==In popular culture==
===Books===
*''[[Moneyball]]'' is a non-fiction book that explores how the Oakland Athletics used statistical analysis to outperform teams with larger budgets. In 2011 a [[Moneyball (film)|film adaptation]] starring [[Brad Pitt]] was released.
*Moneyball is a non-fiction book that explores how the Oakland Athletics used statistical analysis to outperform teams with larger budgets. In 2011 a film adaptation starring Brad Pitt was released.
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 《点球成金》是一本非小说类书籍,书中探讨了奥克兰运动家是如何利用统计分析来超越那些预算较大的团队的。2011年,由布拉德 · 皮特主演的改编电影上映。
===Film===
*In ''[[Captain America: The Winter Soldier]]'', H.Y.D.R.A (disguised as [[S.H.I.E.L.D]]) develops helicarriers that use data to determine and eliminate threats over the globe.
*In ''[[The Dark Knight (film)|The Dark Knight]]'', [[Batman]] uses a sonar device that can spy on all of [[Gotham City]]. The data is gathered from the mobile phones of people within the city.
*In Captain America: The Winter Soldier, H.Y.D.R.A (disguised as S.H.I.E.L.D) develops helicarriers that use data to determine and eliminate threats over the globe.
*In The Dark Knight, Batman uses a sonar device that can spy on all of Gotham City. The data is gathered from the mobile phones of people within the city.
= = = = =
* 美国队长: 冬兵》(Captain America: The Winter Soldier)中,H.Y.D.R.A (伪装成神盾局)开发了一种利用数据来确定和消除全球威胁的飞行母舰。
* 在《蝙蝠侠: 黑暗骑士》中,蝙蝠侠使用了一种可以监视整个哥谭市的声纳设备。这些数据是通过城市里人们的手机收集的。
== See also ==
== See also ==
= = 参见 = =
{{Category see also|LABEL=For a list of companies, and tools, see also|Big data}}
<!-- NO COMPANIES OR TOOL SPAM HERE. That would be an endless list! "See also" concepts, not linked above. -->
{{columns-list|colwidth=15em|
*[[Big data ethics]]
*[[Big Data Maturity Model]]
*[[Big memory]]
*[[Data curation]]
*[[Data defined storage]]
*[[Data lineage]]
*[[Data philanthropy]]
*[[Data science]]
*[[Datafication]]
*[[Document-oriented database]]
*[[In-memory processing]]
*[[List of big data companies]]
*[[Urban informatics]]
*[[Very large database]]
*[[XLDB]]}}
== References ==
{{Reflist
|refs =
<!-- unused<ref name="2017-07-18_Gartner">{{cite web
| url = https://research.gartner.com/definition-whatis-big-data
| title = Gartner IT Glossary > Big Data – From the Gartner IT Glossary: What is Big Data?
| publisher = [[Gartner]]
| access-date = 18 July 2017
| archive-url = https://web.archive.org/web/20170718161704/https://research.gartner.com/definition-whatis-big-data
| archive-date = 18 July 2017
| quote = Gartner IT Glossary > Big Data From the Gartner IT Glossary: What is Big Data? Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.
}}</ref> -->
}}
== Further reading ==
{{Library resources box}}
* {{cite magazine|editor1=Peter Kinnaird |editor2=Inbal Talgam-Cohen|magazine=[[XRDS (magazine)|XRDS: Crossroads, The ACM Magazine for Students]]|title=Big Data|volume=19 |issue=1|date=2012|publisher=[[Association for Computing Machinery]]|issn=1528-4980 |oclc=779657714 |url=http://dl.acm.org/citation.cfm?id=2331042}}
* {{cite book|title=Mining of massive datasets|author1=Jure Leskovec|author2-link=Anand Rajaraman|author2=Anand Rajaraman|author3-link=Jeffrey D. Ullman|author3=Jeffrey D. Ullman|year=2014|publisher=Cambridge University Press|url=http://mmds.org/|isbn=9781107077232 |oclc=888463433|author1-link=Jure Leskovec}}
* {{cite book|author1=Viktor Mayer-Schönberger|author2-link=Kenneth Cukier|author2=Kenneth Cukier|title=Big Data: A Revolution that Will Transform how We Live, Work, and Think|date=2013|publisher=Houghton Mifflin Harcourt|isbn=9781299903029 |oclc=828620988|author1-link=Viktor Mayer-Schönberger}}
* {{cite news |url=https://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data |title=A Very Short History of Big Data |first=Gil |last=Press |work=forbes.com |date=9 May 2013 |access-date=17 September 2016 |location=Jersey City, NJ}}
* {{cite book |title=Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are |year=2017 |first=Seth |last=Stephens-Davidowitz |publisher=Dey Street Books |isbn=978-0062390851}}
* {{cite magazine |url=https://hbr.org/2012/10/big-data-the-management-revolution |title=Big Data: The Management Revolution|magazine=Harvard Business Review |date=October 2012|work=}}
* {{cite book |title=Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy |first=Cathy |last= O'Neil |year=2017 |publisher=Broadway Books |isbn=978-0553418835}}
*
*
*
*
*
*
*
*
*
*
*
*
*
*
== External links ==
*{{Commonsinline}}
* {{Wiktionary-inline|big data}}
*
*
= = = 外部链接 = =
*
*
{{Authority control}}
[[Category:Big data| ]]
[[Category:Data management]]
[[Category:Distributed computing problems]]
[[Category:Transaction processing]]
[[Category:Technology forecasting]]
[[Category:Data analysis]]
[[Category:Databases]]
Category:Data management
Category:Distributed computing problems
Category:Transaction processing
Category:Technology forecasting
Category:Data analysis
Category:Databases
类别: 数据管理类别: 分布式计算/科技预测问题类别: 交易处理类别: 数据分析类别: 数据库
<noinclude>
<small>This page was moved from [[wikipedia:en:Big data]]. Its edit history can be viewed at [[大数据/edithistory]]</small></noinclude>
[[Category:待整理页面]]