更改

删除367字节 、 2022年3月6日 (日) 19:17
无编辑摘要
第1行: 第1行:  
This article is about large collections of data. For the band, see Big Data (band). For the practice of buying and selling of personal and consumer data, see Surveillance capitalism.
 
This article is about large collections of data. For the band, see Big Data (band). For the practice of buying and selling of personal and consumer data, see Surveillance capitalism.
[[File:Hilbert InfoGrowth.png|thumb|right|400px|全球数字信息存储容量的非线性增长和模拟存储的减少。<ref>{{cite journal|url= http://www.martinhilbert.net/WorldInfoCapacity.html|title= The World's Technological Capacity to Store, Communicate, and Compute Information|volume= 332|issue= 6025|pages= 60–65|journal=Science|access-date= 13 April 2016|bibcode= 2011Sci...332...60H|last1= Hilbert|first1= Martin|last2= López|first2= Priscila|year= 2011|doi= 10.1126/science.1200970|pmid= 21310967|s2cid= 206531385}}</ref>]]
+
[[File:Hilbert InfoGrowth.png|thumb|right|400px|全球数字信息存储容量的非线性增长和模拟存储的减少。<ref>{{cite journal|url= http://www.martinhilbert.net/WorldInfoCapacity.html|title= The World's Technological Capacity to Store, Communicate, and Compute Information|volume= 332|issue= 6025|pages= 60–65|journal=Science|access-date= 13 April 2016|bibcode= 2011Sci...332...60H|last1= Hilbert|first1= Martin|last2= López|first2= Priscila|year= 2011|doi= 10.1126/science.1200970|pmid= 21310967}}</ref>]]
      第6行: 第6行:       −
“大数据”一词的当前用法倾向于指[[预测分析]]、[[用户行为分析]]或其他从大数据中提取价值的高级数据分析方法,很少涉及特定规模的数据集。“毫无疑问,现在可用的数据量确实很大,但这并不是这个新数据生态系统最显著的特征。”<ref>{{cite journal |last1=boyd |first1=dana |last2=Crawford |first2=Kate |title=Six Provocations for Big Data |journal=Social Science Research Network: A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society |date=21 September 2011 |doi= 10.2139/ssrn.1926431|s2cid=148610111 |url=http://osf.io/nrjhn/ }}</ref>对数据集的分析可以揭示“商业趋势、疾病预防、打击犯罪等”的新关联。在互联网搜索、金融科技、医疗分析、地理信息系统、城市信息学和商业信息学等领域,科学家、企业高管、医生、广告和政府都经常面对处理大型数据集的困难。科学家也在电子科学工作中遇到了局限,包括气象学、基因组学、<ref>{{cite journal | title = Community cleverness required | journal = Nature | volume = 455 | issue = 7209 | pages = 1 | date = September 2008 | pmid = 18769385 | doi = 10.1038/455001a | bibcode = 2008Natur.455....1. | doi-access = free }}</ref>连接组学、复杂物理模拟、生物学和环境研究。<ref>{{cite journal | vauthors = Reichman OJ, Jones MB, Schildhauer MP | title = Challenges and opportunities of open data in ecology | journal = Science | volume = 331 | issue = 6018 | pages = 703–5 | date = February 2011 | pmid = 21311007 | doi = 10.1126/science.1197962 | bibcode = 2011Sci...331..703R | s2cid = 22686503 | url = https://escholarship.org/uc/item/7627s45z }}</ref>
+
“大数据”一词的当前用法倾向于指[[预测分析]]、[[用户行为分析]]或其他从大数据中提取价值的高级数据分析方法,很少涉及特定规模的数据集。“毫无疑问,现在可用的数据量确实很大,但这并不是这个新数据生态系统最显著的特征。”<ref>{{cite journal |last1=boyd |first1=dana |last2=Crawford |first2=Kate |title=Six Provocations for Big Data |journal=Social Science Research Network: A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society |date=21 September 2011 |doi= 10.2139/ssrn.1926431|url=http://osf.io/nrjhn/ }}</ref>对数据集的分析可以揭示“商业趋势、疾病预防、打击犯罪等”的新关联。在互联网搜索、金融科技、医疗分析、地理信息系统、城市信息学和商业信息学等领域,科学家、企业高管、医生、广告和政府都经常面对处理大型数据集的困难。科学家也在电子科学工作中遇到了局限,包括气象学、基因组学、<ref>{{cite journal | title = Community cleverness required | journal = Nature | volume = 455 | issue = 7209 | pages = 1 | date = September 2008 | pmid = 18769385 | doi = 10.1038/455001a | bibcode = 2008Natur.455....1. | doi-access = free }}</ref>连接组学、复杂物理模拟、生物学和环境研究。<ref>{{cite journal | vauthors = Reichman OJ, Jones MB, Schildhauer MP | title = Challenges and opportunities of open data in ecology | journal = Science | volume = 331 | issue = 6018 | pages = 703–5 | date = February 2011 | pmid = 21311007 | doi = 10.1126/science.1197962 | bibcode = 2011Sci...331..703R | url = https://escholarship.org/uc/item/7627s45z }}</ref>
      −
随着移动设备以及众多廉价的信息传感物联网设备、天线(遥感)、软件日志、相机、麦克风、射频识别(RFID)阅读器和无线传感器网络等设备收集数据,可用数据集的规模和数量在迅速增长。<ref>{{cite web |author= Hellerstein, Joe |title= Parallel Programming in the Age of Big Data |date= 9 November 2008 |work= Gigaom Blog |url= http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/}}</ref><ref>{{cite book |first1= Toby |last1= Segaran |first2= Jeff |last2= Hammerbacher |title= Beautiful Data: The Stories Behind Elegant Data Solutions |url= https://books.google.com/books?id=zxNglqU1FKgC |year= 2009 |publisher= O'Reilly Media |isbn= 978-0-596-15711-1 |page= 257}}</ref>自20世纪80年代以来,世界人均存储信息的技术能力大约每40个月翻一番,<ref name="martinhilbert.net">{{cite journal | vauthors = Hilbert M, López P | title = The world's technological capacity to store, communicate, and compute information | journal = Science | volume = 332 | issue = 6025 | pages = 60–5 | date = April 2011 | pmid = 21310967 | doi = 10.1126/science.1200970 | url = http://www.uvm.edu/pdodds/files/papers/others/2011/hilbert2011a.pdf | bibcode = 2011Sci...332...60H | s2cid = 206531385 }}</ref> 每天约生成2.5 EB (Exabytes )(2.5×2<sup>60</sup>字节)的数据。<ref>{{cite web|url= http://www.ibm.com/big-data/us/en/ |title= IBM What is big data? – Bringing big data to the enterprise |publisher= ibm.com |access-date= 26 August 2013}}</ref>根据IDC的一份报告预测,2013年至2020年间,全球数据量将从4.4 ZB (zettabytes)呈指数增长至44 ZB (zettabytes)。IDC还预测,到2025年,数据量将达到163兆字节。<ref>{{Cite web| url=https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf| title=Data Age 2025: The Evolution of Data to Life-Critical|last1=Reinsel|first1=David|last2=Gantz|first2=John|date=13 April 2017|website=seagate.com|publisher=[[International Data Corporation]]|location=Framingham, MA, US|access-date=2 November 2017|last3=Rydning|first3=John}}</ref>因此大型企业正面临的问题是,谁应该开始计划覆盖全企业的大数据转型计划。<ref>Oracle and FSN, [http://www.fsn.co.uk/channel_bi_bpm_cpm/mastering_big_data_cfo_strategies_to_transform_insight_into_opportunity "Mastering Big Data: CFO Strategies to Transform Insight into Opportunity"], December 2012</ref>
+
随着移动设备以及众多廉价的信息传感物联网设备、天线(遥感)、软件日志、相机、麦克风、射频识别(RFID)阅读器和无线传感器网络等设备收集数据,可用数据集的规模和数量在迅速增长。<ref>{{cite web |author= Hellerstein, Joe |title= Parallel Programming in the Age of Big Data |date= 9 November 2008 |work= Gigaom Blog |url= http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/}}</ref><ref>{{cite book |first1= Toby |last1= Segaran |first2= Jeff |last2= Hammerbacher |title= Beautiful Data: The Stories Behind Elegant Data Solutions |url= https://books.google.com/books?id=zxNglqU1FKgC |year= 2009 |publisher= O'Reilly Media |isbn= 978-0-596-15711-1 |page= 257}}</ref>自20世纪80年代以来,世界人均存储信息的技术能力大约每40个月翻一番,<ref name="martinhilbert.net">{{cite journal | vauthors = Hilbert M, López P | title = The world's technological capacity to store, communicate, and compute information | journal = Science | volume = 332 | issue = 6025 | pages = 60–5 | date = April 2011 | pmid = 21310967 | doi = 10.1126/science.1200970 | url = http://www.uvm.edu/pdodds/files/papers/others/2011/hilbert2011a.pdf | bibcode = 2011Sci...332...60H }}</ref> 每天约生成2.5 EB (Exabytes )(2.5×2<sup>60</sup>字节)的数据。<ref>{{cite web|url= http://www.ibm.com/big-data/us/en/ |title= IBM What is big data? – Bringing big data to the enterprise |publisher= ibm.com |access-date= 26 August 2013}}</ref>根据IDC的一份报告预测,2013年至2020年间,全球数据量将从4.4 ZB (zettabytes)呈指数增长至44 ZB (zettabytes)。IDC还预测,到2025年,数据量将达到163兆字节。<ref>{{Cite web| url=https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf| title=Data Age 2025: The Evolution of Data to Life-Critical|last1=Reinsel|first1=David|last2=Gantz|first2=John|date=13 April 2017|website=seagate.com|publisher=[[International Data Corporation]]|location=Framingham, MA, US|access-date=2 November 2017|last3=Rydning|first3=John}}</ref>因此大型企业正面临的问题是,谁应该开始计划覆盖全企业的大数据转型计划。<ref>Oracle and FSN, [http://www.fsn.co.uk/channel_bi_bpm_cpm/mastering_big_data_cfo_strategies_to_transform_insight_into_opportunity "Mastering Big Data: CFO Strategies to Transform Insight into Opportunity"], December 2012</ref>
      第25行: 第25行:       −
在一项大数据集的对比研究中,Kitchin和McArdle发现,在所有分析案例中,大数据的常见特征并不都一致。<ref>{{cite journal | last1 = Kitchin | first1 = Rob | last2 = McArdle | first2 = Gavin | year = 2016 | title = What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets | journal = Big Data & Society | volume = 3 | pages = 1–10 | doi = 10.1177/2053951716631130 | s2cid = 55539845 }}</ref>因此,其他研究将知识发现中权力动力学的重新定义确定为知识发现的定义特征。<ref>{{cite journal | last1 = Balazka | first1 = Dominik | last2 = Rodighiero | first2 = Dario | year = 2020 | title = Big Data and the Little Big Bang: An Epistemological (R)evolution | journal = Frontiers in Big Data | volume = 3 | page = 31 | doi = 10.3389/fdata.2020.00031 | pmid = 33693404 | pmc = 7931920 | hdl = 1721.1/128865 | hdl-access = free | doi-access = free }}</ref>这种另类视角没有关注大数据的内在特征,而是推动了对对象的关系理解,声称重要的是数据的收集、存储、可用和分析方式。
+
在一项大数据集的对比研究中,Kitchin和McArdle发现,在所有分析案例中,大数据的常见特征并不都一致。<ref>{{cite journal | last1 = Kitchin | first1 = Rob | last2 = McArdle | first2 = Gavin | year = 2016 | title = What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets | journal = Big Data & Society | volume = 3 | pages = 1–10 | doi = 10.1177/2053951716631130}}</ref>因此,其他研究将知识发现中权力动力学的重新定义确定为知识发现的定义特征。<ref>{{cite journal | last1 = Balazka | first1 = Dominik | last2 = Rodighiero | first2 = Dario | year = 2020 | title = Big Data and the Little Big Bang: An Epistemological (R)evolution | journal = Frontiers in Big Data | volume = 3 | page = 31 | doi = 10.3389/fdata.2020.00031 | pmid = 33693404 | pmc = 7931920 | hdl = 1721.1/128865 | hdl-access = free | doi-access = free }}</ref>这种另类视角没有关注大数据的内在特征,而是推动了对对象的关系理解,声称重要的是数据的收集、存储、可用和分析方式。
      第119行: 第119行:       −
多维大数据也可以表示为OLAP数据立方体或数学上的张量。阵列数据库系统已经支持这种数据类型的存储和高级查询。<ref>{{cite web |title=Future Directions in Tensor-Based Computation and Modeling |date=May 2009|url=http://www.cs.cornell.edu/cv/tenwork/finalreport.pdf}}</ref>应用于大数据的其他技术包括基于张量的高效计算,如多线性子空间学习、<ref name="MSLsurvey">{{cite journal | first1 = Haiping | last1 = Lu | first2 = K.N. | last2 = Plataniotis | first3 = A.N. | last3 = Venetsanopoulos | url = http://www.dsp.utoronto.ca/~haiping/Publication/SurveyMSL_PR2011.pdf | title = A Survey of Multilinear Subspace Learning for Tensor Data | journal = Pattern Recognition | volume = 44 | number = 7 | pages = 1540–1551 | year = 2011 | doi = 10.1016/j.patcog.2011.01.004 | bibcode = 2011PatRe..44.1540L }}</ref>大规模并行处理(MPP)数据库、基于搜索的应用、数据挖掘、<ref>{{cite book|last1=Pllana|first1=Sabri|title=2011 14th International Conference on Network-Based Information Systems|pages=341–348|last2=Janciak|first2=Ivan|last3=Brezany|first3=Peter|last4=Wöhrer|first4=Alexander|chapter=A Survey of the State of the Art in Data Mining and Integration Query Languages |website=2011 International Conference on Network-Based Information Systems (NBIS 2011)|publisher=IEEE Computer Society|bibcode=2016arXiv160301113P|year=2016|arxiv=1603.01113|doi=10.1109/NBiS.2011.58|isbn=978-1-4577-0789-6|s2cid=9285984}}</ref>分布式文件系统、分布式缓存(如burst buffer和Memcached)、分布式数据库,基于云和HPC的基础设施(应用程序、存储和计算资源)<ref>{{cite book|chapter=Characterization and Optimization of Memory-Resident MapReduce on HPC Systems|publisher=IEEE|date=October 2014|doi=10.1109/IPDPS.2014.87|title=2014 IEEE 28th International Parallel and Distributed Processing Symposium|pages=799–808|last1=Wang|first1=Yandong|last2=Goldstone|first2=Robin|last3=Yu|first3=Weikuan|last4=Wang|first4=Teng|s2cid=11157612|isbn=978-1-4799-3800-1}}</ref>以及互联网。尽管已经开发了许多方法和技术,但使用大数据进行机器学习仍然很困难。<ref>{{Cite journal|last1=L'Heureux|first1=A.|last2=Grolinger|first2=K.|last3=Elyamany|first3=H. F.|last4=Capretz|first4=M. A. M.|date=2017|title=Machine Learning With Big Data: Challenges and Approaches|journal=IEEE Access|volume=5|pages=7776–7797|doi=10.1109/ACCESS.2017.2696365|issn=2169-3536|doi-access=free}}</ref>
+
多维大数据也可以表示为OLAP数据立方体或数学上的张量。阵列数据库系统已经支持这种数据类型的存储和高级查询。<ref>{{cite web |title=Future Directions in Tensor-Based Computation and Modeling |date=May 2009|url=http://www.cs.cornell.edu/cv/tenwork/finalreport.pdf}}</ref>应用于大数据的其他技术包括基于张量的高效计算,如多线性子空间学习、<ref name="MSLsurvey">{{cite journal | first1 = Haiping | last1 = Lu | first2 = K.N. | last2 = Plataniotis | first3 = A.N. | last3 = Venetsanopoulos | url = http://www.dsp.utoronto.ca/~haiping/Publication/SurveyMSL_PR2011.pdf | title = A Survey of Multilinear Subspace Learning for Tensor Data | journal = Pattern Recognition | volume = 44 | number = 7 | pages = 1540–1551 | year = 2011 | doi = 10.1016/j.patcog.2011.01.004 | bibcode = 2011PatRe..44.1540L }}</ref>大规模并行处理(MPP)数据库、基于搜索的应用、数据挖掘、<ref>{{cite book|last1=Pllana|first1=Sabri|title=2011 14th International Conference on Network-Based Information Systems|pages=341–348|last2=Janciak|first2=Ivan|last3=Brezany|first3=Peter|last4=Wöhrer|first4=Alexander|chapter=A Survey of the State of the Art in Data Mining and Integration Query Languages |website=2011 International Conference on Network-Based Information Systems (NBIS 2011)|publisher=IEEE Computer Society|bibcode=2016arXiv160301113P|year=2016|arxiv=1603.01113|doi=10.1109/NBiS.2011.58|isbn=978-1-4577-0789-6}}</ref>分布式文件系统、分布式缓存(如burst buffer和Memcached)、分布式数据库,基于云和HPC的基础设施(应用程序、存储和计算资源)<ref>{{cite book|chapter=Characterization and Optimization of Memory-Resident MapReduce on HPC Systems|publisher=IEEE|date=October 2014|doi=10.1109/IPDPS.2014.87|title=2014 IEEE 28th International Parallel and Distributed Processing Symposium|pages=799–808|last1=Wang|first1=Yandong|last2=Goldstone|first2=Robin|last3=Yu|first3=Weikuan|last4=Wang|first4=Teng|isbn=978-1-4799-3800-1}}</ref>以及互联网。尽管已经开发了许多方法和技术,但使用大数据进行机器学习仍然很困难。<ref>{{Cite journal|last1=L'Heureux|first1=A.|last2=Grolinger|first2=K.|last3=Elyamany|first3=H. F.|last4=Capretz|first4=M. A. M.|date=2017|title=Machine Learning With Big Data: Challenges and Approaches|journal=IEEE Access|volume=5|pages=7776–7797|doi=10.1109/ACCESS.2017.2696365|issn=2169-3536|doi-access=free}}</ref>
      第141行: 第141行:       −
发达经济体越来越多地使用数据密集型技术。全世界有46亿手机用户,有10亿到20亿人上网。从1990年到2005年,全世界有超过10亿人进入中产阶级,这意味着更多的人变得更有文化,进而导致了信息的增长。1986年,世界通过电信网络交换信息的有效容量为281 PB,1993年为471 PB,2000年为2.2 EB,2007年为65 EB。<ref name="martinhilbert.net"/>据预测,到2014年,互联网流量将达到每年667 EB。据估计,全球存储信息的三分之一是字母数字文本和静态图像数据,<ref name="HilbertContent">{{cite journal|title= What is the Content of the World's Technologically Mediated Information and Communication Capacity: How Much Text, Image, Audio, and Video?| doi= 10.1080/01972243.2013.873748 | volume=30| issue=2 |journal=The Information Society|pages=127–143|year = 2014|last1 = Hilbert|first1 = Martin| s2cid= 45759014 | url= https://escholarship.org/uc/item/87w5f6wb }}</ref> 这是大多数大数据应用最有用的格式。这也显示了尚未使用的(以视频和音频内容的形式)数据的潜力。
+
发达经济体越来越多地使用数据密集型技术。全世界有46亿手机用户,有10亿到20亿人上网。从1990年到2005年,全世界有超过10亿人进入中产阶级,这意味着更多的人变得更有文化,进而导致了信息的增长。1986年,世界通过电信网络交换信息的有效容量为281 PB,1993年为471 PB,2000年为2.2 EB,2007年为65 EB。<ref name="martinhilbert.net"/>据预测,到2014年,互联网流量将达到每年667 EB。据估计,全球存储信息的三分之一是字母数字文本和静态图像数据,<ref name="HilbertContent">{{cite journal|title= What is the Content of the World's Technologically Mediated Information and Communication Capacity: How Much Text, Image, Audio, and Video?| doi= 10.1080/01972243.2013.873748 | volume=30| issue=2 |journal=The Information Society|pages=127–143|year = 2014|last1 = Hilbert|first1 = Martin| url= https://escholarship.org/uc/item/87w5f6wb }}</ref> 这是大多数大数据应用最有用的格式。这也显示了尚未使用的(以视频和音频内容的形式)数据的潜力。
    
虽然许多供应商为大数据提供现成的产品,但如果公司有足够的技术能力,专家则会开发内部定制系统。<ref>{{cite web |url=http://www.kdnuggets.com/2014/07/interview-amy-gershkoff-ebay-in-house-BI-tools.html |title=Interview: Amy Gershkoff, Director of Customer Analytics & Insights, eBay on How to Design Custom In-House BI Tools |last1=Rajpurohit |first1=Anmol |date=11 July 2014 |website= KDnuggets|access-date=14 July 2014|quote=Generally, I find that off-the-shelf business intelligence tools do not meet the needs of clients who want to derive custom insights from their data. Therefore, for medium-to-large organizations with access to strong technical talent, I usually recommend building custom, in-house solutions.}}</ref>
 
虽然许多供应商为大数据提供现成的产品,但如果公司有足够的技术能力,专家则会开发内部定制系统。<ref>{{cite web |url=http://www.kdnuggets.com/2014/07/interview-amy-gershkoff-ebay-in-house-BI-tools.html |title=Interview: Amy Gershkoff, Director of Customer Analytics & Insights, eBay on How to Design Custom In-House BI Tools |last1=Rajpurohit |first1=Anmol |date=11 July 2014 |website= KDnuggets|access-date=14 July 2014|quote=Generally, I find that off-the-shelf business intelligence tools do not meet the needs of clients who want to derive custom insights from their data. Therefore, for medium-to-large organizations with access to strong technical talent, I usually recommend building custom, in-house solutions.}}</ref>
第180行: 第180行:     
=== 医疗 ===
 
=== 医疗 ===
大数据分析通过提供个性化医疗和处方分析、临床风险干预和预测分析、减少废物和护理变异性、患者数据的自动外部和内部报告、标准化医疗术语和患者登记,大数据分析在医疗保健中得到了应用。<ref name="ref135">{{cite journal | vauthors = Huser V, Cimino JJ | title = Impending Challenges for the Use of Big Data | journal = International Journal of Radiation Oncology, Biology, Physics | volume = 95 | issue = 3 | pages = 890–894 | date = July 2016 | pmid = 26797535 | pmc = 4860172 | doi = 10.1016/j.ijrobp.2015.10.060 }}</ref><ref>{{Cite book|title=Signal Processing and Machine Learning for Biomedical Big Data.|others=Sejdić, Ervin, Falk, Tiago H.|isbn=9781351061216|location=[Place of publication not identified]|oclc=1044733829|last1 = Sejdic|first1 = Ervin|last2 = Falk|first2 = Tiago H.|date = 4 July 2018}}</ref><ref>{{cite journal | vauthors = Raghupathi W, Raghupathi V | title = Big data analytics in healthcare: promise and potential | journal = Health Information Science and Systems | volume = 2 | issue = 1 | pages = 3 | date = December 2014 | pmid = 25825667 | pmc = 4341817 | doi = 10.1186/2047-2501-2-3 }}</ref><ref>{{cite journal | vauthors = Viceconti M, Hunter P, Hose R | title = Big data, big knowledge: big data for personalized healthcare | journal = IEEE Journal of Biomedical and Health Informatics | volume = 19 | issue = 4 | pages = 1209–15 | date = July 2015 | pmid = 26218867 | doi = 10.1109/JBHI.2015.2406883 | s2cid = 14710821 | url = http://eprints.whiterose.ac.uk/89104/1/pap%20JBHI%20BigData%20in%20VPH%20revision%20v2.pdf | doi-access = free }}</ref>有些领域的改进比实际执行的更具雄心壮志。医疗保健系统内生成的数据水平并非微不足道。随着移动健康、电子健康和可穿戴技术的广泛应用,数据量将继续增加。这包括电子健康记录数据、成像数据、患者生成的数据、传感器数据和其他难以处理的数据。现在,这种环境更加需要关注数据和信息质量。<ref>{{cite journal|title=Data Management Within mHealth Environments: Patient Sensors, Mobile Devices, and Databases |first1=John| last1=O'Donoghue |first2=John|last2=Herbert|s2cid=2318649|date=1 October 2012|volume=4|issue=1|pages=5:1–5:20| doi=10.1145/2378016.2378021 |journal=Journal of Data and Information Quality}}</ref>“大数据通常意味着‘脏数据’,数据不准确的比例随着数据量的增长而增加。”在大数据范围内进行人体检查是不可能的,卫生服务部门迫切需要智能工具来准确、可信地控制和处理丢失的信息。<ref name="Mirkes2016">{{cite journal | vauthors = Mirkes EM, Coats TJ, Levesley J, Gorban AN | title = Handling missing data in large healthcare dataset: A case study of unknown trauma outcomes | journal = Computers in Biology and Medicine | volume = 75 | pages = 203–16 | date = August 2016 | pmid = 27318570 | doi = 10.1016/j.compbiomed.2016.06.004 | arxiv = 1604.00627 | bibcode = 2016arXiv160400627M | s2cid = 5874067 }}</ref>虽然医疗保健领域的大量信息现在是电子化的,但它符合大数据的要求,因为大多数信息都是非结构化的,难以使用。<ref>{{cite journal | vauthors = Murdoch TB, Detsky AS | title = The inevitable application of big data to health care | journal = JAMA | volume = 309 | issue = 13 | pages = 1351–2 | date = April 2013 | pmid = 23549579 | doi = 10.1001/jama.2013.393 }}</ref>在医疗保健中使用大数据引发了重大的道德挑战,从个人权利、隐私和自主权的风险,到透明度和信任。<ref>{{cite journal | vauthors = Vayena E, Salathé M, Madoff LC, Brownstein JS | title = Ethical challenges of big data in public health | journal = PLOS Computational Biology | volume = 11 | issue = 2 | pages = e1003904 | date = February 2015 | pmid = 25664461 | pmc = 4321985 | doi = 10.1371/journal.pcbi.1003904 | bibcode = 2015PLSCB..11E3904V }}</ref>
+
大数据分析通过提供个性化医疗和处方分析、临床风险干预和预测分析、减少废物和护理变异性、患者数据的自动外部和内部报告、标准化医疗术语和患者登记,大数据分析在医疗保健中得到了应用。<ref name="ref135">{{cite journal | vauthors = Huser V, Cimino JJ | title = Impending Challenges for the Use of Big Data | journal = International Journal of Radiation Oncology, Biology, Physics | volume = 95 | issue = 3 | pages = 890–894 | date = July 2016 | pmid = 26797535 | pmc = 4860172 | doi = 10.1016/j.ijrobp.2015.10.060 }}</ref><ref>{{Cite book|title=Signal Processing and Machine Learning for Biomedical Big Data.|others=Sejdić, Ervin, Falk, Tiago H.|isbn=9781351061216|location=[Place of publication not identified]|oclc=1044733829|last1 = Sejdic|first1 = Ervin|last2 = Falk|first2 = Tiago H.|date = 4 July 2018}}</ref><ref>{{cite journal | vauthors = Raghupathi W, Raghupathi V | title = Big data analytics in healthcare: promise and potential | journal = Health Information Science and Systems | volume = 2 | issue = 1 | pages = 3 | date = December 2014 | pmid = 25825667 | pmc = 4341817 | doi = 10.1186/2047-2501-2-3 }}</ref><ref>{{cite journal | vauthors = Viceconti M, Hunter P, Hose R | title = Big data, big knowledge: big data for personalized healthcare | journal = IEEE Journal of Biomedical and Health Informatics | volume = 19 | issue = 4 | pages = 1209–15 | date = July 2015 | pmid = 26218867 | doi = 10.1109/JBHI.2015.2406883 | url = http://eprints.whiterose.ac.uk/89104/1/pap%20JBHI%20BigData%20in%20VPH%20revision%20v2.pdf | doi-access = free }}</ref>有些领域的改进比实际执行的更具雄心壮志。医疗保健系统内生成的数据水平并非微不足道。随着移动健康、电子健康和可穿戴技术的广泛应用,数据量将继续增加。这包括电子健康记录数据、成像数据、患者生成的数据、传感器数据和其他难以处理的数据。现在,这种环境更加需要关注数据和信息质量。<ref>{{cite journal|title=Data Management Within mHealth Environments: Patient Sensors, Mobile Devices, and Databases |first1=John| last1=O'Donoghue |first2=John|last2=Herbert|date=1 October 2012|volume=4|issue=1|pages=5:1–5:20| doi=10.1145/2378016.2378021 |journal=Journal of Data and Information Quality}}</ref>“大数据通常意味着‘脏数据’,数据不准确的比例随着数据量的增长而增加。”在大数据范围内进行人体检查是不可能的,卫生服务部门迫切需要智能工具来准确、可信地控制和处理丢失的信息。<ref name="Mirkes2016">{{cite journal | vauthors = Mirkes EM, Coats TJ, Levesley J, Gorban AN | title = Handling missing data in large healthcare dataset: A case study of unknown trauma outcomes | journal = Computers in Biology and Medicine | volume = 75 | pages = 203–16 | date = August 2016 | pmid = 27318570 | doi = 10.1016/j.compbiomed.2016.06.004 | arxiv = 1604.00627 | bibcode = 2016arXiv160400627M }}</ref>虽然医疗保健领域的大量信息现在是电子化的,但它符合大数据的要求,因为大多数信息都是非结构化的,难以使用。<ref>{{cite journal | vauthors = Murdoch TB, Detsky AS | title = The inevitable application of big data to health care | journal = JAMA | volume = 309 | issue = 13 | pages = 1351–2 | date = April 2013 | pmid = 23549579 | doi = 10.1001/jama.2013.393 }}</ref>在医疗保健中使用大数据引发了重大的道德挑战,从个人权利、隐私和自主权的风险,到透明度和信任。<ref>{{cite journal | vauthors = Vayena E, Salathé M, Madoff LC, Brownstein JS | title = Ethical challenges of big data in public health | journal = PLOS Computational Biology | volume = 11 | issue = 2 | pages = e1003904 | date = February 2015 | pmid = 25664461 | pmc = 4321985 | doi = 10.1371/journal.pcbi.1003904 | bibcode = 2015PLSCB..11E3904V }}</ref>
      第187行: 第187行:       −
医疗领域中一个严重依赖大数据的子领域是医学中的计算机辅助诊断。<ref name="CAD7challenges">{{cite journal | vauthors = Yanase J, Triantaphyllou E| title = A Systematic Survey of Computer-Aided Diagnosis in Medicine: Past and Present Developments. | journal = Expert Systems with Applications | volume = 138 | pages = 112821 | date = 2019 | doi = 10.1016/j.eswa.2019.112821 | s2cid = 199019309 }}</ref>例如,对于癫痫监测,通常每天创建5到10GB的数据。<ref name=":1">{{cite journal | vauthors = Dong X, Bahroos N, Sadhu E, Jackson T, Chukhman M, Johnson R, Boyd A, Hynes D| title = Leverage Hadoop framework for large scale clinical informatics applications | journal = AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science | pages = 53 | date = 2013 | volume = 2013 | pmid = 24303235 }}</ref>类似地,一张未压缩的乳房断层合成图像的平均数据量为450 MB。<ref name=":2">{{cite journal | vauthors = Clunie D| title = Breast tomosynthesis challenges digital imaging infrastructure | url = http://www.auntminnie.com/index.aspx?sec=prtf&sub=def&pag=dis&itemId=102872&printpage=true&fsec=ser&fsub=def  | date = 2013 }}</ref>这些只是计算机辅助诊断使用大数据的众多例子中的一小部分。因此,大数据被认为是计算机辅助诊断系统需要克服的七大关键挑战之一。<ref>
+
医疗领域中一个严重依赖大数据的子领域是医学中的计算机辅助诊断。<ref name="CAD7challenges">{{cite journal | vauthors = Yanase J, Triantaphyllou E| title = A Systematic Survey of Computer-Aided Diagnosis in Medicine: Past and Present Developments. | journal = Expert Systems with Applications | volume = 138 | pages = 112821 | date = 2019 | doi = 10.1016/j.eswa.2019.112821 }}</ref>例如,对于癫痫监测,通常每天创建5到10GB的数据。<ref name=":1">{{cite journal | vauthors = Dong X, Bahroos N, Sadhu E, Jackson T, Chukhman M, Johnson R, Boyd A, Hynes D| title = Leverage Hadoop framework for large scale clinical informatics applications | journal = AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science | pages = 53 | date = 2013 | volume = 2013 | pmid = 24303235 }}</ref>类似地,一张未压缩的乳房断层合成图像的平均数据量为450 MB。<ref name=":2">{{cite journal | vauthors = Clunie D| title = Breast tomosynthesis challenges digital imaging infrastructure | url = http://www.auntminnie.com/index.aspx?sec=prtf&sub=def&pag=dis&itemId=102872&printpage=true&fsec=ser&fsub=def  | date = 2013 }}</ref>这些只是计算机辅助诊断使用大数据的众多例子中的一小部分。因此,大数据被认为是计算机辅助诊断系统需要克服的七大关键挑战之一。<ref>
{{cite journal | vauthors = Yanase J, Triantaphyllou E | title = The Seven Key Challenges for the Future of Computer-Aided Diagnosis in Medicine | journal =  International Journal of Medical Informatics| volume = 129 | pages = 413–422 | year = 2019 | doi = 10.1016/j.ijmedinf.2019.06.017 | pmid = 31445285 | s2cid = 198287435 }}
+
{{cite journal | vauthors = Yanase J, Triantaphyllou E | title = The Seven Key Challenges for the Future of Computer-Aided Diagnosis in Medicine | journal =  International Journal of Medical Informatics| volume = 129 | pages = 413–422 | year = 2019 | doi = 10.1016/j.ijmedinf.2019.06.017 | pmid = 31445285}}
 
</ref>
 
</ref>
   第203行: 第203行:  
|access-date=21 February 2016
 
|access-date=21 February 2016
 
|url=https://venturebeat.com/2014/04/15/ny-gets-new-bootcamp-for-data-scientists-its-free-but-harder-to-get-into-than-harvard/
 
|url=https://venturebeat.com/2014/04/15/ny-gets-new-bootcamp-for-data-scientists-its-free-but-harder-to-get-into-than-harvard/
}}</ref>在营销的特定领域,Wedel和Kannan强调的一个问题是,<ref>{{cite journal|last=Wedel|first=Michel|author2=Kannan, PK|title= Marketing Analytics for Data-Rich Environments|journal=Journal of Marketing|year=2016|volume=80|issue=6|doi= 10.1509/jm.15.0413|pages=97–121|s2cid=168410284}}</ref>营销有几个子领域(例如广告、促销、产品开发、品牌推广),它们都使用不同类型的数据。
+
}}</ref>在营销的特定领域,Wedel和Kannan强调的一个问题是,<ref>{{cite journal|last=Wedel|first=Michel|author2=Kannan, PK|title= Marketing Analytics for Data-Rich Environments|journal=Journal of Marketing|year=2016|volume=80|issue=6|doi= 10.1509/jm.15.0413|pages=97–121}}</ref>营销有几个子领域(例如广告、促销、产品开发、品牌推广),它们都使用不同类型的数据。
      第342行: 第342行:       −
Tobias Preis和他的同事Helen Susannah Moat和H.Eugene Stanley介绍了一种方法,使用基于谷歌趋势(Google Trends)提供的搜索量数据的交易策略,识别股市走势的在线前兆。他们在科学报告中对谷歌98个不同财务相关性的搜索量进行的分析表明,财务相关搜索量的增加往往先于金融市场的巨大损失。<ref>{{cite journal | url =http://www.nature.com/news/counting-google-searches-predicts-market-movements-1.12879 | title=Counting Google searches predicts market movements | author=Philip Ball | journal=Nature | date=26 April 2013 | doi=10.1038/nature.2013.12879 | s2cid=167357427 | access-date=9 August 2013| author-link=Philip Ball }}</ref> Their analysis of [[Google]] search volume for 98 terms of varying financial relevance, published in ''[[Scientific Reports]]'',<ref>{{cite journal | vauthors = Preis T, Moat HS, Stanley HE | title = Quantifying trading behavior in financial markets using Google Trends | journal = Scientific Reports | volume = 3 | pages = 1684 | year = 2013 | pmid = 23619126 | pmc = 3635219 | doi = 10.1038/srep01684 | bibcode = 2013NatSR...3E1684P }}</ref> suggests that increases in search volume for financially relevant search terms tend to precede large losses in financial markets.<ref>{{cite news | url=http://bits.blogs.nytimes.com/2013/04/26/google-search-terms-can-predict-stock-market-study-finds/ | title= Google Search Terms Can Predict Stock Market, Study Finds | author=Nick Bilton | work=[[The New York Times]] | date=26 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite magazine | url=http://business.time.com/2013/04/26/trouble-with-your-investment-portfolio-google-it/ | title=Trouble With Your Investment Portfolio? Google It! | author=Christopher Matthews | magazine=[[Time (magazine)|Time]] | date=26 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite journal | url= http://www.nature.com/news/counting-google-searches-predicts-market-movements-1.12879 | title=Counting Google searches predicts market movements | author=Philip Ball |journal=[[Nature (journal)|Nature]] | date=26 April 2013 | doi=10.1038/nature.2013.12879 | s2cid=167357427 | access-date=9 August 2013}}</ref><ref>{{cite news | url=http://www.businessweek.com/articles/2013-04-25/big-data-researchers-turn-to-google-to-beat-the-markets | title='Big Data' Researchers Turn to Google to Beat the Markets | author=Bernhard Warner | work=[[Bloomberg Businessweek]] | date=25 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite news | url=https://www.independent.co.uk/news/business/comment/hamish-mcrae/hamish-mcrae-need-a-valuable-handle-on-investor-sentiment-google-it-8590991.html | title=Hamish McRae: Need a valuable handle on investor sentiment? Google it | author=Hamish McRae | work=[[The Independent]] | date=28 April 2013 | access-date=9 August 2013 | location=London}}</ref><ref>{{cite web | url=http://www.ft.com/intl/cms/s/0/e5d959b8-acf2-11e2-b27f-00144feabdc0.html | title= Google search proves to be new word in stock market prediction | author=Richard Waters | work=[[Financial Times]] | date=25 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite news | url =https://www.bbc.co.uk/news/science-environment-22293693 | title=Google searches predict market moves | author=Jason Palmer | work=[[BBC]] | date=25 April 2013 | access-date=9 August 2013}}</ref>
+
Tobias Preis和他的同事Helen Susannah Moat和H.Eugene Stanley介绍了一种方法,使用基于谷歌趋势(Google Trends)提供的搜索量数据的交易策略,识别股市走势的在线前兆。他们在科学报告中对谷歌98个不同财务相关性的搜索量进行的分析表明,财务相关搜索量的增加往往先于金融市场的巨大损失。<ref>{{cite journal | url =http://www.nature.com/news/counting-google-searches-predicts-market-movements-1.12879 | title=Counting Google searches predicts market movements | author=Philip Ball | journal=Nature | date=26 April 2013 | doi=10.1038/nature.2013.12879 | access-date=9 August 2013| author-link=Philip Ball }}</ref> Their analysis of [[Google]] search volume for 98 terms of varying financial relevance, published in ''[[Scientific Reports]]'',<ref>{{cite journal | vauthors = Preis T, Moat HS, Stanley HE | title = Quantifying trading behavior in financial markets using Google Trends | journal = Scientific Reports | volume = 3 | pages = 1684 | year = 2013 | pmid = 23619126 | pmc = 3635219 | doi = 10.1038/srep01684 | bibcode = 2013NatSR...3E1684P }}</ref> suggests that increases in search volume for financially relevant search terms tend to precede large losses in financial markets.<ref>{{cite news | url=http://bits.blogs.nytimes.com/2013/04/26/google-search-terms-can-predict-stock-market-study-finds/ | title= Google Search Terms Can Predict Stock Market, Study Finds | author=Nick Bilton | work=[[The New York Times]] | date=26 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite magazine | url=http://business.time.com/2013/04/26/trouble-with-your-investment-portfolio-google-it/ | title=Trouble With Your Investment Portfolio? Google It! | author=Christopher Matthews | magazine=[[Time (magazine)|Time]] | date=26 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite journal | url= http://www.nature.com/news/counting-google-searches-predicts-market-movements-1.12879 | title=Counting Google searches predicts market movements | author=Philip Ball |journal=[[Nature (journal)|Nature]] | date=26 April 2013 | doi=10.1038/nature.2013.12879 | access-date=9 August 2013}}</ref><ref>{{cite news | url=http://www.businessweek.com/articles/2013-04-25/big-data-researchers-turn-to-google-to-beat-the-markets | title='Big Data' Researchers Turn to Google to Beat the Markets | author=Bernhard Warner | work=[[Bloomberg Businessweek]] | date=25 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite news | url=https://www.independent.co.uk/news/business/comment/hamish-mcrae/hamish-mcrae-need-a-valuable-handle-on-investor-sentiment-google-it-8590991.html | title=Hamish McRae: Need a valuable handle on investor sentiment? Google it | author=Hamish McRae | work=[[The Independent]] | date=28 April 2013 | access-date=9 August 2013 | location=London}}</ref><ref>{{cite web | url=http://www.ft.com/intl/cms/s/0/e5d959b8-acf2-11e2-b27f-00144feabdc0.html | title= Google search proves to be new word in stock market prediction | author=Richard Waters | work=[[Financial Times]] | date=25 April 2013 | access-date=9 August 2013}}</ref><ref>{{cite news | url =https://www.bbc.co.uk/news/science-environment-22293693 | title=Google searches predict market moves | author=Jason Palmer | work=[[BBC]] | date=25 April 2013 | access-date=9 August 2013}}</ref>
      第366行: 第366行:       −
与此大致相同的是,有人指出,基于大数据分析的决策不可避免地“像过去一样,或者充其量也像现在一样,受到世界的影响”。<ref name="HilbertBigData2013">Hilbert, M. (2016). Big Data for Development: A Review of Promises and Challenges. Development Policy Review, 34(1), 135–174. https://doi.org/10.1111/dpr.12142 free access: https://www.martinhilbert.net/big-data-for-development/</ref>如果未来与过去相似,通过大量关于过去经验的数据,算法可以预测未来的发展。如果系统对未来的动态变化(如果它不是一个平稳的过程),那么过去对未来的影响就很小。<ref name="HilbertTEDx">[https://www.youtube.com/watch?v=UXef6yfJZAI Big Data requires Big Visions for Big Change.], Hilbert, M. (2014). London: TEDx UCL, x=independently organized TED talks</ref>为了在不断变化的环境中做出预测,有必要对系统动力学有一个透彻的了解。<ref name="HilbertTEDx"/>作为对这一批评的回应,Alemany Oliver和Vayre建议使用“诱因推理作为研究过程的第一步,以便为消费者的数字痕迹提供背景,并使新的理论出现”。<ref>{{cite journal|last=Alemany Oliver|first=Mathieu |author2=Vayre, Jean-Sebastien |s2cid=111360835 |title= Big Data and the Future of Knowledge Production in Marketing Research: Ethics, Digital Traces, and Abductive Reasoning|journal=Journal of Marketing Analytics |year=2015|volume=3|issue=1|doi= 10.1057/jma.2015.1|pages=5–13}}</ref>此外,有人建议将大数据方法与计算机模拟相结合,例如基于代理的模型<ref name="HilbertBigData2013" />和复杂系统。通过基于一系列相互依赖的算法的计算机模拟,基于代理的模型在预测甚至未知场景的社会复杂性的结果方面越来越好。<ref>{{cite web|url= https://www.theatlantic.com/magazine/archive/2002/04/seeing-around-corners/302471/| title=Seeing Around Corners|author=Jonathan Rauch|date=1 April 2002|work=[[The Atlantic]]}}</ref><ref>Epstein, J. M., & Axtell, R. L. (1996). Growing Artificial Societies: Social Science from the Bottom Up. A Bradford Book.</ref>最后,探索数据潜在结构的多变量方法的使用,如因子分析和聚类分析,已被证明是有用的分析方法,远远超出了通常用于较小数据集的双变量方法。
+
与此大致相同的是,有人指出,基于大数据分析的决策不可避免地“像过去一样,或者充其量也像现在一样,受到世界的影响”。<ref name="HilbertBigData2013">Hilbert, M. (2016). Big Data for Development: A Review of Promises and Challenges. Development Policy Review, 34(1), 135–174. https://doi.org/10.1111/dpr.12142 free access: https://www.martinhilbert.net/big-data-for-development/</ref>如果未来与过去相似,通过大量关于过去经验的数据,算法可以预测未来的发展。如果系统对未来的动态变化(如果它不是一个平稳的过程),那么过去对未来的影响就很小。<ref name="HilbertTEDx">[https://www.youtube.com/watch?v=UXef6yfJZAI Big Data requires Big Visions for Big Change.], Hilbert, M. (2014). London: TEDx UCL, x=independently organized TED talks</ref>为了在不断变化的环境中做出预测,有必要对系统动力学有一个透彻的了解。<ref name="HilbertTEDx"/>作为对这一批评的回应,Alemany Oliver和Vayre建议使用“诱因推理作为研究过程的第一步,以便为消费者的数字痕迹提供背景,并使新的理论出现”。<ref>{{cite journal|last=Alemany Oliver|first=Mathieu |author2=Vayre, Jean-Sebastien|title= Big Data and the Future of Knowledge Production in Marketing Research: Ethics, Digital Traces, and Abductive Reasoning|journal=Journal of Marketing Analytics |year=2015|volume=3|issue=1|doi= 10.1057/jma.2015.1|pages=5–13}}</ref>此外,有人建议将大数据方法与计算机模拟相结合,例如基于代理的模型<ref name="HilbertBigData2013" />和复杂系统。通过基于一系列相互依赖的算法的计算机模拟,基于代理的模型在预测甚至未知场景的社会复杂性的结果方面越来越好。<ref>{{cite web|url= https://www.theatlantic.com/magazine/archive/2002/04/seeing-around-corners/302471/| title=Seeing Around Corners|author=Jonathan Rauch|date=1 April 2002|work=[[The Atlantic]]}}</ref><ref>Epstein, J. M., & Axtell, R. L. (1996). Growing Artificial Societies: Social Science from the Bottom Up. A Bradford Book.</ref>最后,探索数据潜在结构的多变量方法的使用,如因子分析和聚类分析,已被证明是有用的分析方法,远远超出了通常用于较小数据集的双变量方法。
      第375行: 第375行:       −
Nayef Al-Rodhan认为,在大数据和拥有大量信息的大公司的背景下,需要一种新的社会契约来保护个人自由,大数据的使用应该在国家和国际层面受到更好的监管。<ref>{{Cite news|url=http://hir.harvard.edu/the-social-contract-2-0-big-data-and-the-need-to-guarantee-privacy-and-civil-liberties/|title=The Social Contract 2.0: Big Data and the Need to Guarantee Privacy and Civil Liberties – Harvard International Review|last=Al-Rodhan|first=Nayef|date=16 September 2014|work=Harvard International Review|access-date=3 April 2017|archive-url=https://web.archive.org/web/20170413090835/http://hir.harvard.edu/the-social-contract-2-0-big-data-and-the-need-to-guarantee-privacy-and-civil-liberties/|archive-date=13 April 2017|url-status=dead}}</ref>Barocas和Nissenbaum认为,保护个人用户的一种方法是,让用户了解所收集的信息类型、与谁共享信息、在什么约束下以及出于什么目的。<ref>{{Cite book|title=Big Data's End Run around Anonymity and Consent| last1 =Barocas |first1=Solon |last2=Nissenbaum |first2=Helen|last3=Lane|first3=Julia|last4=Stodden|first4=Victoria|last5=Bender|first5=Stefan|last6=Nissenbaum|first6=Helen| s2cid =152939392|date=June 2014| publisher =Cambridge University Press|isbn=9781107067356|pages=44–75|doi =10.1017/cbo9781107590205.004}}</ref>
+
Nayef Al-Rodhan认为,在大数据和拥有大量信息的大公司的背景下,需要一种新的社会契约来保护个人自由,大数据的使用应该在国家和国际层面受到更好的监管。<ref>{{Cite news|url=http://hir.harvard.edu/the-social-contract-2-0-big-data-and-the-need-to-guarantee-privacy-and-civil-liberties/|title=The Social Contract 2.0: Big Data and the Need to Guarantee Privacy and Civil Liberties – Harvard International Review|last=Al-Rodhan|first=Nayef|date=16 September 2014|work=Harvard International Review|access-date=3 April 2017|archive-url=https://web.archive.org/web/20170413090835/http://hir.harvard.edu/the-social-contract-2-0-big-data-and-the-need-to-guarantee-privacy-and-civil-liberties/|archive-date=13 April 2017|url-status=dead}}</ref>Barocas和Nissenbaum认为,保护个人用户的一种方法是,让用户了解所收集的信息类型、与谁共享信息、在什么约束下以及出于什么目的。<ref>{{Cite book|title=Big Data's End Run around Anonymity and Consent| last1 =Barocas |first1=Solon |last2=Nissenbaum |first2=Helen|last3=Lane|first3=Julia|last4=Stodden|first4=Victoria|last5=Bender|first5=Stefan|last6=Nissenbaum|first6=Helen|date=June 2014| publisher =Cambridge University Press|isbn=9781107067356|pages=44–75|doi =10.1017/cbo9781107590205.004}}</ref>
      第392行: 第392行:     
=== 针对大数据执行的批评 ===
 
=== 针对大数据执行的批评 ===
Ulf Dietrich Reips和Uwe Matzat在2014年写道,大数据已经成为科学研究的“风潮”。<ref name="pigdata" />研究人员Danah Boyd对大数据在科学中的使用提出了担忧,因为研究往往忽略了一些原则,比如选择代表性样本时过于关注处理大量数据,<ref name="danah">{{cite web | url=http://www.danah.org/papers/talks/2010/WWW2010.html | title=Privacy and Publicity in the Context of Big Data | author=danah boyd | work=[[World Wide Web Conference|WWW 2010 conference]] | date=29 April 2010 | access-date = 18 April 2011| author-link=danah boyd }}</ref>这种方法可能会导致结果在某种程度上存在偏差。<ref>{{Cite journal|last=Katyal|first=Sonia K.|date=2019|title=Artificial Intelligence, Advertising, and Disinformation|url=https://muse.jhu.edu/article/745987|journal=Advertising & Society Quarterly|language=en|volume=20|issue=4|doi=10.1353/asr.2019.0026|s2cid=213397212|issn=2475-1790}}</ref>大量异构数据资源的集成(有些被认为是大数据,有些则不是)带来巨大的后勤和分析挑战,但许多研究人员认为,这种集成可能代表着科学领域最有前途的新前沿。<ref>{{cite journal |last1=Jones |first1=MB |last2=Schildhauer |first2=MP |last3=Reichman |first3=OJ |last4=Bowers | first4=S |title=The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere | journal=Annual Review of Ecology, Evolution, and Systematics |volume=37 |issue=1 |pages=519–544 |year=2006 |doi=10.1146/annurev.ecolsys.37.091305.110031 |url= http://www.pnamp.org/sites/default/files/Jones2006_AREES.pdf }}</ref>在这篇颇具煽动性的文章《大数据的关键问题》(Critical Questions for Big Data)中,<ref name="danah2">{{cite journal | doi = 10.1080/1369118X.2012.678878| title = Critical Questions for Big Data| journal = Information, Communication & Society| volume = 15| issue = 5| pages = 662–679| year = 2012| last1 = Boyd | first1 = D. | last2 = Crawford | first2 = K. | s2cid = 51843165| hdl = 10983/1320| hdl-access = free}}</ref>作者将大数据称为神话的一部分:“大数据集提供了更高形式的智能和知识……大数据的用户往往“迷失在庞大的数据量中”,而且“使用大数据仍然是主观的,它量化的东西不一定能够更接近客观事实”。<ref name="danah2" />BI领域的最新发展,例如前瞻性报告,特别是通过自动过滤无用数据及相关性来改善大数据的可用性。<ref name="Big Decisions White Paper">[http://www.fortewares.com/Administrator/userfiles/Banner/forte-wares--pro-active-reporting_EN.pdf Failure to Launch: From Big Data to Big Decisions] Forte Wares.</ref>大数据充满了虚假的相关性,<ref>{{Cite web | url=https://www.tylervigen.com/spurious-correlations | title=15 Insane Things That Correlate with Each Other}}</ref>要么是因为非因果巧合(真大数定律),要么是大随机数的唯一性<ref>[https://onlinelibrary.wiley.com/loi/10982418 Random structures & algorithms]</ref> (拉姆齐理论)或其他未发现的因素,因此早期实验者建立大型数字数据库“用数据说话”以及宣称的革新科学方法都受到了质疑。<ref>Cristian S. Calude, Giuseppe Longo, (2016), The Deluge of Spurious Correlations in Big Data, [[Foundations of Science]]</ref>
+
Ulf Dietrich Reips和Uwe Matzat在2014年写道,大数据已经成为科学研究的“风潮”。<ref name="pigdata" />研究人员Danah Boyd对大数据在科学中的使用提出了担忧,因为研究往往忽略了一些原则,比如选择代表性样本时过于关注处理大量数据,<ref name="danah">{{cite web | url=http://www.danah.org/papers/talks/2010/WWW2010.html | title=Privacy and Publicity in the Context of Big Data | author=danah boyd | work=[[World Wide Web Conference|WWW 2010 conference]] | date=29 April 2010 | access-date = 18 April 2011| author-link=danah boyd }}</ref>这种方法可能会导致结果在某种程度上存在偏差。<ref>{{Cite journal|last=Katyal|first=Sonia K.|date=2019|title=Artificial Intelligence, Advertising, and Disinformation|url=https://muse.jhu.edu/article/745987|journal=Advertising & Society Quarterly|language=en|volume=20|issue=4|doi=10.1353/asr.2019.0026|issn=2475-1790}}</ref>大量异构数据资源的集成(有些被认为是大数据,有些则不是)带来巨大的后勤和分析挑战,但许多研究人员认为,这种集成可能代表着科学领域最有前途的新前沿。<ref>{{cite journal |last1=Jones |first1=MB |last2=Schildhauer |first2=MP |last3=Reichman |first3=OJ |last4=Bowers | first4=S |title=The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere | journal=Annual Review of Ecology, Evolution, and Systematics |volume=37 |issue=1 |pages=519–544 |year=2006 |doi=10.1146/annurev.ecolsys.37.091305.110031 |url= http://www.pnamp.org/sites/default/files/Jones2006_AREES.pdf }}</ref>在这篇颇具煽动性的文章《大数据的关键问题》(Critical Questions for Big Data)中,<ref name="danah2">{{cite journal | doi = 10.1080/1369118X.2012.678878| title = Critical Questions for Big Data| journal = Information, Communication & Society| volume = 15| issue = 5| pages = 662–679| year = 2012| last1 = Boyd | first1 = D. | last2 = Crawford | first2 = K. | hdl = 10983/1320| hdl-access = free}}</ref>作者将大数据称为神话的一部分:“大数据集提供了更高形式的智能和知识……大数据的用户往往“迷失在庞大的数据量中”,而且“使用大数据仍然是主观的,它量化的东西不一定能够更接近客观事实”。<ref name="danah2" />BI领域的最新发展,例如前瞻性报告,特别是通过自动过滤无用数据及相关性来改善大数据的可用性。<ref name="Big Decisions White Paper">[http://www.fortewares.com/Administrator/userfiles/Banner/forte-wares--pro-active-reporting_EN.pdf Failure to Launch: From Big Data to Big Decisions] Forte Wares.</ref>大数据充满了虚假的相关性,<ref>{{Cite web | url=https://www.tylervigen.com/spurious-correlations | title=15 Insane Things That Correlate with Each Other}}</ref>要么是因为非因果巧合(真大数定律),要么是大随机数的唯一性<ref>[https://onlinelibrary.wiley.com/loi/10982418 Random structures & algorithms]</ref> (拉姆齐理论)或其他未发现的因素,因此早期实验者建立大型数字数据库“用数据说话”以及宣称的革新科学方法都受到了质疑。<ref>Cristian S. Calude, Giuseppe Longo, (2016), The Deluge of Spurious Correlations in Big Data, [[Foundations of Science]]</ref>
      第404行: 第404行:     
=== 针对大数据监管和监视批评 ===
 
=== 针对大数据监管和监视批评 ===
大数据已被执法和企业等机构用于警务和监视。<ref>{{Cite news|url=https://www.economist.com/open-future/2018/06/04/how-data-driven-policing-threatens-human-freedom|title=How data-driven policing threatens human freedom|date=4 June 2018|newspaper=The Economist|access-date=27 October 2019|issn=0013-0613}}</ref> 与传统的警务方法相比,基于数据的监控不那么明显,因此反对大数据警务的可能性较小。根据Sarah Brayne的《大数据监控:警务案例 Big Data Surveillance: The Case of Policing》,<ref>{{Cite journal|last=Brayne|first=Sarah|s2cid=3609838|date=29 August 2017|title=Big Data Surveillance: The Case of Policing|journal=American Sociological Review |volume=82|issue=5|pages=977–1008|language=en|doi=10.1177/0003122417725865}}</ref> 大数据警务会通过三种方式加剧现有的社会不平等:
+
大数据已被执法和企业等机构用于警务和监视。<ref>{{Cite news|url=https://www.economist.com/open-future/2018/06/04/how-data-driven-policing-threatens-human-freedom|title=How data-driven policing threatens human freedom|date=4 June 2018|newspaper=The Economist|access-date=27 October 2019|issn=0013-0613}}</ref> 与传统的警务方法相比,基于数据的监控不那么明显,因此反对大数据警务的可能性较小。根据Sarah Brayne的《大数据监控:警务案例 Big Data Surveillance: The Case of Policing》,<ref>{{Cite journal|last=Brayne|first=Sarah|date=29 August 2017|title=Big Data Surveillance: The Case of Policing|journal=American Sociological Review |volume=82|issue=5|pages=977–1008|language=en|doi=10.1177/0003122417725865}}</ref> 大数据警务会通过三种方式加剧现有的社会不平等:
    
*通过使用一个数学的无偏算法,将嫌疑犯置于更严格的监视之下。
 
*通过使用一个数学的无偏算法,将嫌疑犯置于更严格的监视之下。
7,129

个编辑