非结构化数据

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
跳到导航 跳到搜索

此词条暂由彩云小译翻译,翻译字数共879,未经人工整理和审校,带来阅读不便,请见谅。

Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

非结构化数据(或非结构化信息)是指没有预先定义的数据模型或者没有以预先定义的方式组织的信息。非结构化信息通常是文本密集型的,但也可能包含日期、数字和事实等数据。这导致了不规则性和模糊性,使得使用传统程序与存储在数据库中的字段形式数据或文档中的注释(语义标记)数据相比难以理解。


In 1998, Merrill Lynch said "unstructured data comprises the vast majority of data found in an organization, some estimates run as high as 80%."[1] It's unclear what the source of this number is, but nonetheless it is accepted by some.[2] Other sources have reported similar or higher percentages of unstructured data.[3][4][5]

In 1998, Merrill Lynch said "unstructured data comprises the vast majority of data found in an organization, some estimates run as high as 80%." It's unclear what the source of this number is, but nonetheless it is accepted by some. Other sources have reported similar or higher percentages of unstructured data.

1998年,Merrill Lynch 说: “非结构化数据包含了一个组织中发现的绝大多数数据,有些估计高达80% 。”目前还不清楚这个数字的来源是什么,但尽管如此,一些人还是接受了这个数字。其他消息来源也报告了类似或更高的百分比非结构化数据。


模板:Asof, IDC and Dell EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010.[6] More recently, IDC and Seagate predict that the global datasphere will grow to 163 zettabytes by 2025 [7] and majority of that will be unstructured. The Computer World magazine states that unstructured information might account for more than 70%–80% of all data in organizations.模板:Ref


Background

The earliest research into business intelligence focused in on unstructured textual data, rather than numerical data.[8] As early as 1958, computer science researchers like H.P. Luhn were particularly concerned with the extraction and classification of unstructured text.[8] However, only since the turn of the century has the technology caught up with the research interest. In 2004, the SAS Institute developed the SAS Text Miner, which uses Singular Value Decomposition (SVD) to reduce a hyper-dimensional textual space into smaller dimensions for significantly more efficient machine-analysis.[9] The mathematical and technological advances sparked by machine textual analysis prompted a number of businesses to research applications, leading to the development of fields like sentiment analysis, voice of the customer mining, and call center optimization.[10] The emergence of Big Data in the late 2000s led to a heightened interest in the applications of unstructured data analytics in contemporary fields such as predictive analytics and root cause analysis.[11]

The earliest research into business intelligence focused in on unstructured textual data, rather than numerical data. As early as 1958, computer science researchers like H.P. Luhn were particularly concerned with the extraction and classification of unstructured text. The mathematical and technological advances sparked by machine textual analysis prompted a number of businesses to research applications, leading to the development of fields like sentiment analysis, voice of the customer mining, and call center optimization. The emergence of Big Data in the late 2000s led to a heightened interest in the applications of unstructured data analytics in contemporary fields such as predictive analytics and root cause analysis.

最早的商业智能研究集中在非结构化的文本数据,而不是数字数据。早在1958年,像惠普这样的计算机科学研究人员。Luhn 特别关注非结构化文本的提取和分类。由机器文本分析引发的数学和技术进步促使许多企业研究应用程序,导致了诸如情感分析、客户声音挖掘和呼叫中心优化等领域的发展。21世纪后期大数据的出现导致了人们对非结构化数据分析在当代领域的应用兴趣的提高,比如预测分析和根本原因分析。


Issues with terminology

The term is imprecise for several reasons:

The term is imprecise for several reasons:

由于以下几个原因,这个术语并不精确:

  1. Structure, while not formally defined, can still be implied.
Structure, while not formally defined, can still be implied.

结构虽然没有正式定义,但仍然可以隐含。

  1. Data with some form of structure may still be characterized as unstructured if its structure is not helpful for the processing task at hand.
Data with some form of structure may still be characterized as unstructured if its structure is not helpful for the processing task at hand.

如果具有某种结构形式的数据的结构对手头的处理任务没有帮助,那么它仍然可以被描述为非结构化的。

  1. Unstructured information might have some structure (semi-structured) or even be highly structured but in ways that are unanticipated or unannounced.
Unstructured information might have some structure (semi-structured) or even be highly structured but in ways that are unanticipated or unannounced.

非结构化信息可能具有某种结构(半结构化) ,甚至是高度结构化的,但其方式出人意料或未经事先宣布。


Dealing with unstructured data

Techniques such as data mining, natural language processing (NLP), and text analytics provide different methods to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata or part-of-speech tagging for further text mining-based structuring. The Unstructured Information Management Architecture (UIMA) standard provided a common framework for processing this information to extract meaning and create structured data about the information.[12]

Techniques such as data mining, natural language processing (NLP), and text analytics provide different methods to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata or part-of-speech tagging for further text mining-based structuring. The Unstructured Information Management Architecture (UIMA) standard provided a common framework for processing this information to extract meaning and create structured data about the information.

诸如数据挖掘、自然语言处理(NLP)和文本分析等技术提供了不同的方法来发现这些信息中的模式或以其他方式解释这些信息。构建文本的常用技术通常包括使用元数据手工标记或词性标记,以便进一步基于文本挖掘构建文本。非结构化信息管理体系结构(UIMA)标准为处理此信息提供了一个公共框架,以提取信息的含义并创建关于信息的结构化数据。


Software that creates machine-processable structure can utilize the linguistic, auditory, and visual structure that exist in all forms of human communication.[13] Algorithms can infer this inherent structure from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents, metadata, health records, audio, video, analog data, images, files, and unstructured text such as the body of an e-mail message, Web page, or word-processor document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, …) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data".[14] For example, an HTML web page is tagged, but HTML mark-up typically serves solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. XHTML tagging does allow machine processing of elements, although it typically does not capture or convey the semantic meaning of tagged terms.

Software that creates machine-processable structure can utilize the linguistic, auditory, and visual structure that exist in all forms of human communication. Algorithms can infer this inherent structure from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents, metadata, health records, audio, video, analog data, images, files, and unstructured text such as the body of an e-mail message, Web page, or word-processor document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, …) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data". For example, an HTML web page is tagged, but HTML mark-up typically serves solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. XHTML tagging does allow machine processing of elements, although it typically does not capture or convey the semantic meaning of tagged terms.

创造机器可处理结构的软件可以利用存在于所有人类交流形式中的语言、听觉和视觉结构。算法可以从文本中推断出这种内在结构,例如,通过检查词汇形态学、句子句法和其他小规模和大规模的模式。然后可以对非结构化信息进行丰富和标记,以解决模糊性和基于相关性的技术,然后用于促进搜索和发现。“非结构化数据”的例子可能包括书籍、期刊、文档、元数据、健康记录、音频、视频、模拟数据、图像、文件和非结构化文本,如电子邮件主体、网页或文字处理文档。虽然被传达的主要内容没有一个定义的结构,但它通常被打包成对象(例如:。在文件或文档中,... ...)它们自身具有结构,因此是结构化和非结构化数据的混合体,但总的来说,这仍被称为“非结构化数据”。例如,HTML 网页被标记,但 HTML 标记通常只用于呈现。它没有以支持自动处理页面信息内容的方式捕获标记元素的含义或功能。XHTML 标签允许机器处理元素,尽管它通常不捕获或传达已标记术语的语义含义。


Since unstructured data commonly occurs in electronic documents, the use of a content or document management system which can categorize entire documents is often preferred over data transfer and manipulation from within the documents. Document management thus provides the means to convey structure onto document collections.

Since unstructured data commonly occurs in electronic documents, the use of a content or document management system which can categorize entire documents is often preferred over data transfer and manipulation from within the documents. Document management thus provides the means to convey structure onto document collections.

由于非结构化数据文档通常出现在电子文档中,使用内容或文档管理系统对整个文档进行分类通常比使用文档内部的数据传输和操作更受欢迎。因此,文档管理提供了将结构传递到文档集合的方法。


Search engines have become popular tools for indexing and searching through such data, especially text.

Search engines have become popular tools for indexing and searching through such data, especially text.

搜索引擎已经成为通过这些数据,特别是文本进行索引和搜索的流行工具。


Approaches in natural language processing

Specific computational workflows have been developed to impose structure upon the unstructured data contained within text documents. These workflows are generally designed to handle sets of thousands or even millions of documents, or far more than manual approaches to annotation may permit. Several of these approaches are based upon the concept of online analytical processing, or OLAP, and may be supported by data models such as text cubes.[15] Once document metadata is available through a data model, generating summaries of subsets of documents (i.e., cells within a text cube) may be performed with phrase-based approaches.[16]

Specific computational workflows have been developed to impose structure upon the unstructured data contained within text documents. These workflows are generally designed to handle sets of thousands or even millions of documents, or far more than manual approaches to annotation may permit. Several of these approaches are based upon the concept of online analytical processing, or OLAP, and may be supported by data models such as text cubes. Once document metadata is available through a data model, generating summaries of subsets of documents (i.e., cells within a text cube) may be performed with phrase-based approaches.

特定的计算工作流程已经被开发出来,用于强加文本文档中包含的非结构化数据结构。这些工作流通常用于处理成千上万甚至上百万个文档,或者远远超过手工注释方法所允许的范围。其中一些方法基于联机分析处理(OLAP)的概念,并且可能受到文本立方体等数据模型的支持。一旦文档元数据通过数据模型可用,就可以使用基于短语的方法来生成文档子集的摘要(即文本立方体中的单元格)。


Approaches in medicine and biomedical research

Biomedical research generates one major source of unstructured data as researchers often publish their findings in scholarly journals. Though the language in these documents is challenging to derive structural elements from (e.g., due to the complicated technical vocabulary contained within and the domain knowledge required to fully contextualize observations), the results of these activities may yield links between technical and medical studies[17] and clues regarding new disease therapies.[18] Recent efforts to enforce structure upon biomedical documents include self-organizing map approaches for identifying topics among documents,[19] general-purpose unsupervised algorithms,[20] and an application of the CaseOLAP workflow[16] to determine associations between protein names and cardiovascular disease topics in the literature.[21] CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.[21]

Biomedical research generates one major source of unstructured data as researchers often publish their findings in scholarly journals. Though the language in these documents is challenging to derive structural elements from (e.g., due to the complicated technical vocabulary contained within and the domain knowledge required to fully contextualize observations), the results of these activities may yield links between technical and medical studies and clues regarding new disease therapies. Recent efforts to enforce structure upon biomedical documents include self-organizing map approaches for identifying topics among documents, general-purpose unsupervised algorithms, and an application of the CaseOLAP workflow CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.

由于研究人员经常在学术期刊上发表他们的发现,《生物医学研究产生了一个主要的非结构化数据来源。尽管这些文件中的语言很难从中推导出结构性元素(例如,由于包含在其中的复杂的术语集和完全上下文化观察所需的领域知识) ,这些活动的结果可能会在技术和医学研究之间产生联系,并提供有关新疾病疗法的线索。最近在生物医学文档上加强结构的努力包括用于识别文档主题的自组织映射方法,通用的无监督算法,和 CaseOLAP 工作流程 CaseOLAP 的应用,它以精确(识别关系)、一致(高度可重复性)和高效的方式定义短语-类别关系。该平台提供了更强的可访问性,并为广泛应用的生物医学研究应用程序提供了短语挖掘工具,从而增强了生物医学界的权能。


See also


Notes

  1. 模板:Note Today’s Challenge in Government: What to do with Unstructured Information and Why Doing Nothing Isn’t An Option, Noel Yuhanna, Principal Analyst, Forrester Research, Nov 2010
Today’s Challenge in Government:  What to do with Unstructured Information and Why Doing Nothing Isn’t An Option,  Noel Yuhanna, Principal Analyst, Forrester Research, Nov 2010

今天政府面临的挑战: 如何处理非结构化信息以及为什么无所作为不是一个选择,Forrester 研究所首席分析师 Noel Yuhanna,2010年11月


References

  1. Shilakes, Christopher C.; Tylman, Julie (16 Nov 1998). "Enterprise Information Portals" (PDF). Merrill Lynch. Archived from the original (PDF) on 24 July 2011.
  2. Grimes, Seth (1 August 2008). "Unstructured Data and the 80 Percent Rule". Breakthrough Analysis - Bridgepoints. Clarabridge.
  3. Gandomi, Amir; Haider, Murtaza (April 2015). "Beyond the hype: Big data concepts, methods, and analytics". International Journal of Information Management. 35 (2): 137–144. doi:10.1016/j.ijinfomgt.2014.10.007. ISSN 0268-4012.
  4. "The biggest data challenges that you might not even know you have - Watson". Watson (in English). 2016-05-25. Retrieved 2018-10-02.
  5. "Structured vs. Unstructured Data". www.datamation.com (in English). Retrieved 2018-10-02.
  6. "EMC News Press Release: New Digital Universe Study Reveals Big Data Gap: Less Than 1% of World's Data is Analyzed; Less Than 20% is Protected". www.emc.com. EMC Corporation. December 2012.
  7. "Trends | Seagate US". Seagate.com (in English). Retrieved 2018-10-01.
  8. 8.0 8.1 Grimes, Seth. "A Brief History of Text Analytics". B Eye Network. Retrieved June 24, 2016.
  9. Albright, Russ. "Taming Text with the SVD" (PDF). SAS. Retrieved June 24, 2016.
  10. Desai, Manish (2009-08-09). "Applications of Text Analytics". My Business Analytics @ Blogspot. Retrieved June 24, 2016.
  11. Chakraborty, Goutam. "Analysis of Unstructured Data: Applications of Text Analytics and Sentiment Mining" (PDF). SAS. Retrieved June 24, 2016.
  12. Holzinger, Andreas; Stocker, Christof; Ofner, Bernhard; Prohaska, Gottfried; Brabenetz, Alberto; Hofmann-Wellenhof, Rainer (2013). "Combining HCI, Natural Language Processing, and Knowledge Discovery – Potential of IBM Content Analytics as an Assistive Technology in the Biomedical Field". In Holzinger, Andreas; Pasi, Gabriella. Human-Computer Interaction and Knowledge Discovery in Complex, Unstructured, Big Data. Lecture Notes in Computer Science. Springer. pp. 13–24. doi:10.1007/978-3-642-39146-0_2. ISBN 978-3-642-39146-0. https://semanticscholar.org/paper/6a81bb782a68c72ec26e79463cd2aec1d0cd917c. 
  13. "Structure, Models and Meaning: Is "unstructured" data merely unmodeled?". InformationWeek (in English). March 1, 2005.
  14. Malone, Robert (April 5, 2007). "Structuring Unstructured Data". Forbes (in English).
  15. Lin, Cindy Xide; Ding, Bolin; Han, Jiawei; Zhu, Feida; Zhao, Bo (December 2008) (in en-US). Text Cube: Computing IR Measures for Multidimensional Text Database Analysis. IEEE. doi:10.1109/icdm.2008.135. ISBN 9780769535029. 
  16. 16.0 16.1 Tao, Fangbo; Zhuang, Honglei; Yu, Chi Wang; Wang, Qi; Cassidy, Taylor; Kaplan, Lance; Voss, Clare; Han, Jiawei (2016). "Multi-Dimensional, Phrase-Based Summarization in Text Cubes" (PDF).
  17. Collier, Nigel; Nazarenko, Adeline; Baud, Robert; Ruch, Patrick (June 2006). "Recent advances in natural language processing for biomedical applications". International Journal of Medical Informatics. 75 (6): 413–417. doi:10.1016/j.ijmedinf.2005.06.008. ISSN 1386-5056. PMID 16139564.
  18. Gonzalez, Graciela H.; Tahsin, Tasnia; Goodale, Britton C.; Greene, Anna C.; Greene, Casey S. (January 2016). "Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery". Briefings in Bioinformatics. 17 (1): 33–42. doi:10.1093/bib/bbv087. ISSN 1477-4054. PMC 4719073. PMID 26420781.
  19. Skupin, André; Biberstine, Joseph R.; Börner, Katy (2013). "Visualizing the topical structure of the medical sciences: a self-organizing map approach". PLOS ONE. 8 (3): e58779. doi:10.1371/journal.pone.0058779. ISSN 1932-6203. PMC 3595294. PMID 23554924.
  20. Kiela, Douwe; Guo, Yufan; Stenius, Ulla; Korhonen, Anna (2015-04-01). "Unsupervised discovery of information structure in biomedical documents". Bioinformatics. 31 (7): 1084–1092. doi:10.1093/bioinformatics/btu758. ISSN 1367-4811. PMID 25411329.
  21. 21.0 21.1 Liem, David A.; Murali, Sanjana; Sigdel, Dibakar; Shi, Yu; Wang, Xuan; Shen, Jiaming; Choi, Howard; Caufield, John H.; Wang, Wei; Ping, Peipei; Han, Jiawei (Oct 1, 2018). "Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease". American Journal of Physiology. Heart and Circulatory Physiology. 315 (4): H910–H924. doi:10.1152/ajpheart.00175.2018. ISSN 1522-1539. PMC 6230912. PMID 29775406.


External links

Category:Data

类别: 数据

Category:Information technology management

类别: 信息技术管理

Category:Business intelligence

分类: 商业智能


This page was moved from wikipedia:en:Unstructured data. Its edit history can be viewed at 非结构化数据/edithistory