“信息抽取”的版本间的差异

2021年7月20日 (二) 17:42的版本

此词条暂由彩云小译翻译，翻译字数共2004，未经人工整理和审校，带来阅读不便，请见谅。

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction

信息抽取是从非结构化和/或半结构化的机器可读文档和其他电子表示的源中自动提取结构化信息的任务。在大多数情况下，这种活动涉及到通过自然语言处理(NLP)来处理人类语言文本。最近在多媒体文档处理方面的活动，如图像/音频/视频/文档的自动注释和内容提取，可以被视为信息抽取

Due to the difficulty of the problem, current approaches to IE focus on narrowly restricted domains. An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation:

[math]\displaystyle{ \mathrm{MergerBetween}(company_1, company_2, date) }[/math],

from an online news sentence such as:

"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."

Due to the difficulty of the problem, current approaches to IE focus on narrowly restricted domains. An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation:

\mathrm{MergerBetween}(company_1, company_2, date),

from an online news sentence such as:

"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."

由于这个问题的难度，目前的 IE 方法集中在狭窄的限制领域。一个例子就是从新闻通讯社关于公司合并的报道中提取，例如从一个在线新闻句子中提取: “昨天，总部位于纽约的 Foo 公司宣布收购了 Bar corp. 。”

A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context.

IE 的一个广泛的目标是允许在之前的非结构化数据上进行计算。一个更具体的目标是允许逻辑推理基于输入数据的逻辑内容做出推论。结构化数据是来自选定目标域的语义上定义良好的数据，根据类别和上下文进行解释。

Information Extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR)^[1] has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to “understand” an attack article only enough to find data corresponding to the slots in this template.

Information Extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR) has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to “understand” an attack article only enough to find data corresponding to the slots in this template.

信息抽取是一个更大的难题的一部分，它涉及的问题是设计文本管理的自动方法，超越了它的传输，存储和显示。信息检索学科已经开发出了自动化的方法，典型的统计方法，用于为大型文档集合建立索引和对文档进行分类。另一个互补的方法是自然语言处理(NLP) ，它解决了人类语言处理建模的问题，在考虑到任务的规模时取得了相当大的成功。就难度和重点而言，IE 处理介于 IR 和 NLP 之间的任务。在输入方面，IE 假定存在一组文档，其中每个文档都遵循一个模板，即。以类似于其他文档中的方式描述一个或多个实体或事件，但在细节上有所不同。例如，考虑一组关于拉丁美洲恐怖主义的新闻专线文章，每一条都被认为是基于一种或多种恐怖主义行为。我们还为任何给定的 IE 任务定义了一个模板，它是一个(或一组)案例框架，用于保存单个文档中包含的信息。对于恐怖主义的例子，一个模板应该有与恐怖主义行为的肇事者、受害者和武器相对应的位置，以及事件发生的日期。针对这个问题的 IE 系统需要“理解”一篇攻击文章，只要找到与此模板中插槽相对应的数据即可。

History

Information extraction dates back to the late 1970s in the early days of NLP.^[2] An early commercial system from the mid-1980s was JASPER built for Reuters by the Carnegie Group Inc with the aim of providing real-time financial news to financial traders.^[3]

Information extraction dates back to the late 1970s in the early days of NLP. An early commercial system from the mid-1980s was JASPER built for Reuters by the Carnegie Group Inc with the aim of providing real-time financial news to financial traders.

信息抽取可以追溯到20世纪70年代末 NLP 的早期。早期的商业系统是20世纪80年代中期由卡内基集团公司为路透社建立的 JASPER 系统，其目的是为金融交易员提供实时的财经新闻。

Beginning in 1987, IE was spurred by a series of Message Understanding Conferences. MUC is a competition-based conference^[4] that focused on the following domains:

MUC-1 (1987), MUC-2 (1989): Naval operations messages.
MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
MUC-5 (1993): Joint ventures and microelectronics domain.
MUC-6 (1995): News articles on management changes.
MUC-7 (1998): Satellite launch reports.

Beginning in 1987, IE was spurred by a series of Message Understanding Conferences. MUC is a competition-based conferenceMarco Costantino, Paolo Coletti, Information Extraction in Finance, Wit Press, 2008. that focused on the following domains:

MUC-1 (1987), MUC-2 (1989): Naval operations messages.
MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
MUC-5 (1993): Joint ventures and microelectronics domain.
MUC-6 (1995): News articles on management changes.
MUC-7 (1998): Satellite launch reports.

从1987年开始，IE 受到了一系列信息理解会议的激励。是一个以竞争为基础的会议，该会议由巴西科斯坦蒂诺大学金融信息抽取 Paolo Coletti 主办，Wit 出版社，2008年。主要关注以下领域:

MUC-1(1987) ，MUC-2(1989) : 海军行动信息。
MUC-3(1991) ，MUC-4(1992) : 拉丁美洲国家的恐怖主义。
MUC-5(1993) : 合资企业和微电子学域。
MUC-6(1995) : 关于管理变革的新闻文章。
MUC-7(1998) : 卫星发射报告。

Considerable support came from the U.S. Defense Advanced Research Projects Agency (DARPA), who wished to automate mundane tasks performed by government analysts, such as scanning newspapers for possible links to terrorism.^{[citation needed]}

Considerable support came from the U.S. Defense Advanced Research Projects Agency (DARPA), who wished to automate mundane tasks performed by government analysts, such as scanning newspapers for possible links to terrorism.

美国国防部高级研究计划局(DARPA)提供了大量的支持，他们希望将政府分析人员执行的日常任务自动化，比如扫描报纸以寻找与恐怖主义的可能联系。

Present significance

The present significance of IE pertains to the growing amount of information available in unstructured form. Tim Berners-Lee, inventor of the world wide web, refers to the existing Internet as the web of documents ^[5] and advocates that more of the content be made available as a web of data.^[6] Until this transpires, the web largely consists of unstructured documents lacking semantic metadata. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted.^[7]

The present significance of IE pertains to the growing amount of information available in unstructured form. Tim Berners-Lee, inventor of the world wide web, refers to the existing Internet as the web of documents and advocates that more of the content be made available as a web of data. Until this transpires, the web largely consists of unstructured documents lacking semantic metadata. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted.R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine",Journal of Natural Language Engineering, Cambridge U. Press, 14(1), 2008, pp.33-69.

IE 目前的重要意义在于以非结构化的形式获得越来越多的信息。万维网的发明者 Tim Berners-Lee 将现有的互联网称为文档网络，并主张更多的内容以数据网络的形式提供。在此之前，网络大部分是由缺乏语义元数据的非结构化文档组成的。这些文档中包含的知识可以通过转换为关系形式或使用 XML 标记使机器处理更容易访问。一个监控新闻数据源的智能代理需要 IE 将非结构化数据变成可以理解的东西。IE 的一个典型应用程序是扫描一组用自然语言编写的文档，并用提取的信息填充数据库。牛和康奈尔，《在 foxtract: 一个可定制的中级信息抽取引擎》，《自然语言工程杂志》，剑桥大学出版社。按14(1) ，2008，pp. 33-69。

Tasks and subtasks

Applying information extraction to text is linked to the problem of text simplification in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical IE tasks and subtasks include:

将信息抽取应用于文本是与文本简化问题联系在一起的，以便创建一个自由文本信息的结构化视图。总体目标是创建一个更容易机器阅读的文本来处理句子。典型的 IE 任务和子任务包括:

Template filling: Extracting a fixed set of fields from a document, e.g. extract perpetrators, victims, time, etc. from a newspaper article about a terrorist attack.
- Event extraction: Given an input document, output zero or more event templates. For instance, a newspaper article might describe multiple terrorist attacks.
Knowledge Base Population: Fill a database of facts given a set of documents. Typically the database is in the form of triplets, (entity 1, relation, entity 2), e.g. (Barack Obama, Spouse, Michelle Obama)
- Named entity recognition: recognition of known entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions, by employing existing knowledge of the domain or information extracted from other sentences.^[8] Typically the recognition task involves assigning a unique identifier to the extracted entity. A simpler task is named entity detection, which aims at detecting entities without having any existing knowledge about the entity instances. For example, in processing the sentence "M. Smith likes fishing", named entity detection would denote detecting that the phrase "M. Smith" does refer to a person, but without necessarily having (or using) any knowledge about a certain M. Smith who is (or, "might be") the specific person whom that sentence is talking about.
- Coreference resolution: detection of coreference and anaphoric links between text entities. In IE tasks, this is typically restricted to finding links between previously-extracted named entities. For example, "International Business Machines" and "IBM" refer to the same real-world entity. If we take the two sentences "M. Smith likes fishing. But he doesn't like biking", it would be beneficial to detect that "he" is referring to the previously detected person "M. Smith".
- Relationship extraction: identification of relations between entities,^[8] such as:
  - PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
  - PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
Semi-structured information extraction which may refer to any IE that tries to restore some kind of information structure that has been lost through publication, such as:
- Table extraction: finding and extracting tables from documents.^[9]^[10]
- Table information extraction : extracting information in structured manner from the tables. This is more complex task than table extraction, as table extraction is only the first step, while understanding the roles of the cells, rows, columns, linking the information inside the table and understanding the information presented in the table are additional tasks necessary for table information extraction. ^[11]^[12]^[13]
- Comments extraction : extracting comments from actual content of article in order to restore the link between author of each sentence
Language and vocabulary analysis
- Terminology extraction: finding the relevant terms for a given corpus
Audio extraction
- Template-based music extraction: finding relevant characteristic in an audio signal taken from a given repertoire; for instance ^[14] time indexes of occurrences of percussive sounds can be extracted in order to represent the essential rhythmic component of a music piece.

Template filling: Extracting a fixed set of fields from a document, e.g. extract perpetrators, victims, time, etc. from a newspaper article about a terrorist attack.
- Event extraction: Given an input document, output zero or more event templates. For instance, a newspaper article might describe multiple terrorist attacks.
Knowledge Base Population: Fill a database of facts given a set of documents. Typically the database is in the form of triplets, (entity 1, relation, entity 2), e.g. (Barack Obama, Spouse, Michelle Obama)
- Named entity recognition: recognition of known entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions, by employing existing knowledge of the domain or information extracted from other sentences. Typically the recognition task involves assigning a unique identifier to the extracted entity. A simpler task is named entity detection, which aims at detecting entities without having any existing knowledge about the entity instances. For example, in processing the sentence "M. Smith likes fishing", named entity detection would denote detecting that the phrase "M. Smith" does refer to a person, but without necessarily having (or using) any knowledge about a certain M. Smith who is (or, "might be") the specific person whom that sentence is talking about.
- Coreference resolution: detection of coreference and anaphoric links between text entities. In IE tasks, this is typically restricted to finding links between previously-extracted named entities. For example, "International Business Machines" and "IBM" refer to the same real-world entity. If we take the two sentences "M. Smith likes fishing. But he doesn't like biking", it would be beneficial to detect that "he" is referring to the previously detected person "M. Smith".
- Relationship extraction: identification of relations between entities, such as:
  - PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
  - PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
Semi-structured information extraction which may refer to any IE that tries to restore some kind of information structure that has been lost through publication, such as:
- Table extraction: finding and extracting tables from documents.
- Table information extraction : extracting information in structured manner from the tables. This is more complex task than table extraction, as table extraction is only the first step, while understanding the roles of the cells, rows, columns, linking the information inside the table and understanding the information presented in the table are additional tasks necessary for table information extraction.
- Comments extraction : extracting comments from actual content of article in order to restore the link between author of each sentence
Language and vocabulary analysis
- Terminology extraction: finding the relevant terms for a given corpus
Audio extraction
- Template-based music extraction: finding relevant characteristic in an audio signal taken from a given repertoire; for instance A.Zils, F.Pachet, O.Delerue and F. Gouyon, Automatic Extraction of Drum Tracks from Polyphonic Music Signals, Proceedings of WedelMusic, Darmstadt, Germany, 2002. time indexes of occurrences of percussive sounds can be extracted in order to represent the essential rhythmic component of a music piece.

模板填充: 从文档中提取一组固定的字段，例如。提取肇事者、受害者、时间等。报纸上一篇关于恐怖袭击的文章。
事件提取: 给定一个输入文档，输出零个或多个事件模板。例如，一篇报纸文章可能描述了多起恐怖袭击。
知识库人口: 填充给定一组文件的事实数据库。通常数据库是三元组的形式，例如: 实体1，关系，实体2。命名实体识别: 利用现有的领域知识或从其他句子中提取的信息，识别已知的实体名称(用于人和组织)、地名、时间表达式和某些类型的数字表达式。一般来说，识别任务需要将一个唯一标识符分配给提取的实体。一个简单的任务是命名实体检测，其目的是检测实体没有任何实体实例的现有知识。例如，在处理”史密斯先生喜欢捕鱼”一句时，命名实体检测将表示检测到”史密斯先生”一词确实指的是一个人，但不一定了解(或使用)某个史密斯先生，他就是(或”可能是”)该句所指的具体人。
共指消解: 检测文本实体之间的共指和回指链接。在 IE 任务中，这通常局限于查找以前提取的命名实体之间的链接。例如，“ International Business Machines”和“ IBM”指的是相同的实际实体。如果我们把这两个句子取为“史密斯先生喜欢钓鱼。但是他不喜欢骑自行车”，如果能够发现“他”指的是先前被发现的人“ m · 史密斯”，那就更好了。
关系抽取: 识别实体之间的关系，例如:
PERSON 为 ORGANIZATION 工作(摘自“ Bill works for IBM.”一句)
位于位置的人(摘自“ Bill is in France.”一句)
半结构化信息抽取，它可能指的是任何试图恢复某种信息结构的 IE，这种信息结构在发布过程中已经丢失，例如:
表提取: 从文档中查找和提取表。
表信息抽取: 以结构化方式从表中提取信息。这比表格提取更复杂，因为表格提取只是第一步，而理解单元格、行、列的角色、表格内信息的链接和理解表格中的信息是表格/信息抽取所必需的额外任务。
注释提取: 从文章的实际内容中提取注释，以恢复每个句子的作者之间的联系
语言和词汇分析
术语提取: 为给定语料库寻找相关术语
音频提取
基于模板的音乐提取: 从给定曲目的音频信号中寻找相关特征，例如 A.Zils，F.Pachet，O.Delerue 和 f. Gouyon，自动提取复调音乐信号中的鼓音轨，WedelMusic Proceedings，达姆施塔特，2002。提取敲击音出现的时间索引，以表示音乐作品的基本节奏成分。

Note that this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning, statistical analysis and/or natural language processing are often used in IE.

请注意，这一清单并非详尽无遗，而且普遍不接受 IE 活动的确切含义，许多方法将 IE 的多个子任务结合起来，以实现更广泛的目标。IE 中经常使用机器学习、统计分析和/或自然语言处理。

IE on non-text documents is becoming an increasingly interesting topic模板:When in research, and information extracted from multimedia documents can now模板:When be expressed in a high level structure as it is done on text. This naturally leads to the fusion of extracted information from multiple kinds of documents and sources.

IE on non-text documents is becoming an increasingly interesting topic in research, and information extracted from multimedia documents can now be expressed in a high level structure as it is done on text. This naturally leads to the fusion of extracted information from multiple kinds of documents and sources.

非文本文档的 IE 正成为一个越来越引人注目的研究课题，从多媒体文档中提取的信息现在可以像在文本中一样以高层次的结构表达。这自然导致了从多种文档和资源中提取的信息的融合。

World Wide Web applications

IE has been the focus of the MUC conferences. The proliferation of the Web, however, intensified the need for developing IE systems that help people to cope with the enormous amount of data that is available online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover, linguistic analysis performed for unstructured text does not exploit the HTML/XML tags and the layout formats that are available in online texts. As a result, less linguistically intensive approaches have been developed for IE on the Web using wrappers, which are sets of highly accurate rules that extract a particular page's content. Manually developing wrappers has proved to be a time-consuming task, requiring a high level of expertise. Machine learning techniques, either supervised or unsupervised, have been used to induce such rules automatically.

IE 已经成为 MUC 会议的焦点。然而，随着互联网的普及，人们更加需要开发 IE 系统，以帮助人们处理在线可用的大量数据。从在线文本执行 IE 的系统应该满足低成本、开发灵活性和易于适应新领域的要求。MUC 系统不能满足这些标准。此外，对非结构化文本执行的语言分析并没有利用 HTML/XML 标记和在线文本中可用的布局格式。因此，使用包装器为 IE 开发了语言密集度较低的方法，这些包装器是一组高度精确的规则，可以提取特定页面的内容。事实证明，手动开发包装器是一项耗时的任务，需要高水平的专业知识。机器学习技术，无论是监督或无监督，已被用来归纳这些规则自动。

Wrappers typically handle highly structured collections of web pages, such as product catalogs and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on adaptive information extraction motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured texts.

Wrappers typically handle highly structured collections of web pages, such as product catalogs and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on adaptive information extraction motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured texts.

Wrappers 通常处理高度结构化的网页集合，如产品目录和电话目录。然而，当文本类型结构化程度较低时，它们就会失败，这在 Web 上也很常见。最近在自适应信息抽取方面的努力促进了 IE 系统的发展，该系统可以处理不同类型的文本，从结构良好的到几乎是自由的文本——这是通常的包装器失败的地方——包括混合类型。这样的系统可以利用浅层的自然语言知识，因此也可以应用于结构化程度较低的文本。

A recent模板:When development is Visual Information Extraction,^[15]^[16] that relies on rendering a webpage in a browser and creating rules based on the proximity of regions in the rendered web page. This helps in extracting entities from complex web pages that may exhibit a visual pattern, but lack a discernible pattern in the HTML source code.

A recent development is Visual Information Extraction, that relies on rendering a webpage in a browser and creating rules based on the proximity of regions in the rendered web page. This helps in extracting entities from complex web pages that may exhibit a visual pattern, but lack a discernible pattern in the HTML source code.

最近的一个发展是 Visual 信息抽取，它依赖于在浏览器中渲染网页，并根据渲染网页中区域的接近程度创建规则。这有助于从复杂的网页中提取实体，这些网页可能表现出一种视觉模式，但在 HTML 源代码中缺乏一种可识别的模式。

Approaches

The following standard approaches are now widely accepted:

Hand-written regular expressions (or nested group of regular expressions)
Using classifiers
- Generative: naïve Bayes classifier
- Discriminative: maximum entropy models such as Multinomial logistic regression
Sequence models
- Recurrent neural network
- Hidden Markov model
- Conditional Markov model (CMM) / Maximum-entropy Markov model (MEMM)
- Conditional random fields (CRF) are commonly used in conjunction with IE for tasks as varied as extracting information from research papers^[17] to extracting navigation instructions.^[18]

The following standard approaches are now widely accepted:

Hand-written regular expressions (or nested group of regular expressions)
Using classifiers
- Generative: naïve Bayes classifier
- Discriminative: maximum entropy models such as Multinomial logistic regression
Sequence models
- Recurrent neural network
- Hidden Markov model
- Conditional Markov model (CMM) / Maximum-entropy Markov model (MEMM)
- Conditional random fields (CRF) are commonly used in conjunction with IE for tasks as varied as extracting information from research papers to extracting navigation instructions.

下面的标准方法现在已经被广泛接受:

手写的正则表达式(或嵌套的正则表达式组)
使用分类器
生成式: 幼稚的贝叶斯分类器
判别式: 最大熵模型，如多项式 Logit模型
序列模型
递归神经网络
隐马尔可夫模型
条件马尔可夫模型(CMM)/最大熵马尔可夫模型(MEMM)
条件随机场(CRF)通常与 IE 结合使用，用于从研究论文中提取信息以提取导航指令等各种任务。

Numerous other approaches exist for IE including hybrid approaches that combine some of the standard approaches previously listed.

IE 还有许多其他方法，包括混合方法，它们结合了以前列出的一些标准方法。

Free or open source software and services

General Architecture for Text Engineering (GATE) is bundled with a free Information Extraction system
Apache OpenNLP is a Java machine learning toolkit for natural language processing
OpenCalais is an automated information extraction web service from Thomson Reuters (Free limited version)
Machine Learning for Language Toolkit (Mallet) is a Java-based package for a variety of natural language processing tasks, including information extraction.
DBpedia Spotlight is an open source tool in Java/Scala (and free web service) that can be used for named entity recognition and name resolution.
Natural Language Toolkit is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language
See also CRF implementations

General Architecture for Text Engineering (GATE) is bundled with a free Information Extraction system
Apache OpenNLP is a Java machine learning toolkit for natural language processing
OpenCalais is an automated information extraction web service from Thomson Reuters (Free limited version)
Machine Learning for Language Toolkit (Mallet) is a Java-based package for a variety of natural language processing tasks, including information extraction.
DBpedia Spotlight is an open source tool in Java/Scala (and free web service) that can be used for named entity recognition and name resolution.
Natural Language Toolkit is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language
See also CRF implementations

文本工程通用体系结构(GATE)与免费信息抽取系统捆绑在一起
Apache OpenNLP 是一个用于自然语言处理的 Java 机器学习工具包
OpenCalais 是来自 Thomson Reuters 的一个自动化的信息抽取网络服务(免费限制版本)
Machine Learning for Language Toolkit (Mallet)是一个用于各种自然语言处理任务的基于 Java 的软件包，包括信息抽取。
DBpedia Spotlight 是 Java/Scala 中的一个开源工具(以及免费的 web 服务) ，可用于命名实体识别和名称解析。
自然语言工具包是一套用于 Python 的符号和统计自然语言处理(NLP)的库和程序

References

↑ FREITAG, DAYNE. "Machine Learning for Information Extraction in Informal Domains" (PDF). 2000 Kluwer Academic Publishers. Printed in the Netherlands.
↑ Andersen, Peggy M.; Hayes, Philip J.; Huettner, Alison K.; Schmandt, Linda M.; Nirenburg, Irene B.; Weinstein, Steven P. (1992). "Automatic Extraction of Facts from Press Releases to Generate News Stories". Proceedings of the third conference on Applied natural language processing -. pp. 170–177. doi:10.3115/974499.974531. https://www.aclweb.org/anthology/A92-1024.
↑ Cowie, Jim; Wilks, Yorick (1996). Information Extraction. p. 3. http://pdfs.semanticscholar.org/2c90/fa59c6d9beed8dcb0e844725b872d3f33a35.pdf.

↑ Marco Costantino, Paolo Coletti, Information Extraction in Finance, Wit Press, 2008.

↑ "Linked Data - The Story So Far" (PDF).
↑ "Tim Berners-Lee on the next Web".
↑ R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine",Journal of Natural Language Engineering,https://en.wikipedia.org/wiki/Defekte_Weblinks?dwl={{{url}}} Seite nicht mehr abrufbar], Suche in Webarchiven: Kategorie:Wikipedia:Weblink offline (andere Namensräume)[http://timetravel.mementoweb.org/list/2010/Kategorie:Wikipedia:Vorlagenfehler/Vorlage:Toter Link/URL_fehlt Cambridge U. Press, 14(1), 2008, pp.33-69.
↑ ^8.0 ^8.1 Dat Quoc Nguyen and Karin Verspoor (2019). "End-to-end neural relation extraction using deep biaffine attention". Proceedings of the 41st European Conference on Information Retrieval (ECIR). arXiv:1812.11275. doi:10.1007/978-3-030-15712-8_47.
↑ Milosevic N, Gregson C, Hernandez R, Nenadic G (February 2019). "A framework for information extraction from tables in biomedical literature". International Journal on Document Analysis and Recognition (IJDAR). 22 (1): 55–78. arXiv:1902.10031. Bibcode:2019arXiv190210031M. doi:10.1007/s10032-019-00317-0. S2CID 62880746.
↑ Milosevic, Nikola (2018). A multi-layered approach to information extraction from tables in biomedical documents (PDF) (PhD). University of Manchester.
↑ Milosevic N, Gregson C, Hernandez R, Nenadic G (February 2019). "A framework for information extraction from tables in biomedical literature". International Journal on Document Analysis and Recognition (IJDAR). 22 (1): 55–78. arXiv:1902.10031. Bibcode:2019arXiv190210031M. doi:10.1007/s10032-019-00317-0. S2CID 62880746.
↑ Milosevic N, Gregson C, Hernandez R, Nenadic G (June 2016). "Disentangling the structure of tables in scientific literature". 21st International Conference on Applications of Natural Language to Information Systems. Lecture Notes in Computer Science. 21: 162–174. doi:10.1007/978-3-319-41754-7_14. ISBN 978-3-319-41753-0.
↑ Milosevic, Nikola (2018). A multi-layered approach to information extraction from tables in biomedical documents (PDF) (PhD). University of Manchester.
↑ A.Zils, F.Pachet, O.Delerue and F. Gouyon, Automatic Extraction of Drum Tracks from Polyphonic Music Signals, Proceedings of WedelMusic, Darmstadt, Germany, 2002.
↑ Chenthamarakshan, Vijil; Desphande, Prasad M; Krishnapuram, Raghu; Varadarajan, Ramakrishnan; Stolze, Knut (2015). "WYSIWYE: An Algebra for Expressing Spatial and Textual Rules for Information Extraction". arXiv:1506.08454 [cs.CL].
↑ 模板:Cite document
↑ Peng, F.; McCallum, A. (2006). "Information extraction from research papers using conditional random fields☆". Information Processing & Management. 42 (4): 963. doi:10.1016/j.ipm.2005.09.002.
↑ Shimizu, Nobuyuki; Hass, Andrew (2006). "Extracting Frame-based Knowledge Representation from Route Instructions" (PDF). Archived from the original (PDF) on 2006-09-01. Retrieved 2010-03-27.

模板:Refimprove

External links

Alias-I "competition" page A listing of academic toolkits and industrial toolkits for natural language information extraction.
Gabor Melli's page on IE Detailed description of the information extraction task.

Alias-I "competition" page A listing of academic toolkits and industrial toolkits for natural language information extraction.
Gabor Melli's page on IE Detailed description of the information extraction task.

Alias-I“ competition”页面自然语言信息抽取的学术工具包和工业工具包清单。信息抽取任务的详细描述。

模板:Natural Language Processing

Category:Natural language processing Category:Artificial intelligence

类别: 自然语言处理类别: 人工智能

This page was moved from wikipedia:en:Information extraction. Its edit history can be viewed at 信息抽取/edithistory

[1] FREITAG, DAYNE. "Machine Learning for Information Extraction in Informal Domains" (PDF). 2000 Kluwer Academic Publishers. Printed in the Netherlands.

[2] Andersen, Peggy M.; Hayes, Philip J.; Huettner, Alison K.; Schmandt, Linda M.; Nirenburg, Irene B.; Weinstein, Steven P. (1992). "Automatic Extraction of Facts from Press Releases to Generate News Stories". Proceedings of the third conference on Applied natural language processing -. pp. 170–177. doi:10.3115/974499.974531. https://www.aclweb.org/anthology/A92-1024.

[3] Cowie, Jim; Wilks, Yorick (1996). Information Extraction. p. 3. http://pdfs.semanticscholar.org/2c90/fa59c6d9beed8dcb0e844725b872d3f33a35.pdf.

[4] Marco Costantino, Paolo Coletti, Information Extraction in Finance, Wit Press, 2008.
v
t
e
Inline tags
Attribution

{{Attribution needed}}

{{By whom}}

{{Weasel-inline}}

{{Which?}}

{{Who}}

Citation

{{Better source}}

{{Citation needed}}
{{Citation needed span}}

{{Reference necessary}}

{{Citation needed (lead)}}

{{Failed verification}}

{{Full}}

{{Primary source-inline}}

{{Request quotation}}

{{Retracted}}

{{Third-party-inline}}

Incomplete

{{Author missing}}

{{Author incomplete}}

{{Date missing}}

{{ISBN missing}}

{{Page needed}}

{{Publisher missing}}

{{Title incomplete}}

{{Year missing}}

{{Dead link}}

Content

{{Contradict-inline}}

{{Contradiction-inline}}

{{Dubious|talk page section name}}

{{Examples}}

{{Inconsistent}}

{{List fact}}

{{Lopsided}}

{{POV-statement}}

Date and place

{{Clarify timeframe}}

{{Quantify}}

{{Update after}}

{{Update-small}}

{{When}}

{{Where}}

{{Year needed}}

Wikification

{{Disambiguation needed}}

{{Pronunciation needed}}

Wording

{{Ambiguous}}

{{Awkward}}

{{Buzz}}

{{Clarify}}

{{Clarify span}}

{{Definition}}

{{Elucidate}}

{{Expand acronym}}

{{Technical-statement}}

{{Vague|optional message to be displayed on mouseover}}

{{Why?}}

Category

[5] v

[6] t

[7] e

[8] {{Attribution needed}}

[9] {{By whom}}

[10] {{Weasel-inline}}

[11] {{Which?}}

[12] {{Who}}

[13] {{Better source}}

[14] {{Citation needed}}
{{Citation needed span}}

{{Reference necessary}}

[15] {{Citation needed span}}

[16] {{Reference necessary}}

[17] {{Citation needed (lead)}}

[18] {{Failed verification}}

[19] {{Full}}

[20] {{Primary source-inline}}

[21] {{Request quotation}}

[22] {{Retracted}}

[23] {{Third-party-inline}}

[24] {{Author missing}}

[25] {{Author incomplete}}

[26] {{Date missing}}

[27] {{ISBN missing}}

[28] {{Page needed}}

[29] {{Publisher missing}}

[30] {{Title incomplete}}

[31] {{Year missing}}

[32] {{Dead link}}

[33] {{Contradict-inline}}

[34] {{Contradiction-inline}}

[35] {{Dubious|talk page section name}}

[36] {{Examples}}

[37] {{Inconsistent}}

[38] {{List fact}}

[39] {{Lopsided}}

[40] {{POV-statement}}

[41] {{Clarify timeframe}}

[42] {{Quantify}}

[43] {{Update after}}

[44] {{Update-small}}

[45] {{When}}

[46] {{Where}}

[47] {{Year needed}}

[48] {{Disambiguation needed}}

[49] {{Pronunciation needed}}

[50] {{Ambiguous}}

[51] {{Awkward}}

[52] {{Buzz}}

[53] {{Clarify}}

[54] {{Clarify span}}

[55] {{Definition}}

[56] {{Elucidate}}

[57] {{Expand acronym}}

[58] {{Technical-statement}}

[59] {{Vague|optional message to be displayed on mouseover}}

[60] {{Why?}}

[5] "Linked Data - The Story So Far" (PDF).

[6] "Tim Berners-Lee on the next Web".

[7] R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine",Journal of Natural Language Engineering,https://en.wikipedia.org/wiki/Defekte_Weblinks?dwl={{{url}}} Seite nicht mehr abrufbar], Suche in Webarchiven: Kategorie:Wikipedia:Weblink offline (andere Namensräume)[http://timetravel.mementoweb.org/list/2010/Kategorie:Wikipedia:Vorlagenfehler/Vorlage:Toter Link/URL_fehlt Cambridge U. Press, 14(1), 2008, pp.33-69.

[ecir2019-8] 8.0 ^8.1 Dat Quoc Nguyen and Karin Verspoor (2019). "End-to-end neural relation extraction using deep biaffine attention". Proceedings of the 41st European Conference on Information Retrieval (ECIR). arXiv:1812.11275. doi:10.1007/978-3-030-15712-8_47.

[9] Milosevic N, Gregson C, Hernandez R, Nenadic G (February 2019). "A framework for information extraction from tables in biomedical literature". International Journal on Document Analysis and Recognition (IJDAR). 22 (1): 55–78. arXiv:1902.10031. Bibcode:2019arXiv190210031M. doi:10.1007/s10032-019-00317-0. S2CID 62880746.

[10] Milosevic, Nikola (2018). A multi-layered approach to information extraction from tables in biomedical documents (PDF) (PhD). University of Manchester.

[11] Milosevic N, Gregson C, Hernandez R, Nenadic G (February 2019). "A framework for information extraction from tables in biomedical literature". International Journal on Document Analysis and Recognition (IJDAR). 22 (1): 55–78. arXiv:1902.10031. Bibcode:2019arXiv190210031M. doi:10.1007/s10032-019-00317-0. S2CID 62880746.

[12] Milosevic N, Gregson C, Hernandez R, Nenadic G (June 2016). "Disentangling the structure of tables in scientific literature". 21st International Conference on Applications of Natural Language to Information Systems. Lecture Notes in Computer Science. 21: 162–174. doi:10.1007/978-3-319-41754-7_14. ISBN 978-3-319-41753-0.

[13] Milosevic, Nikola (2018). A multi-layered approach to information extraction from tables in biomedical documents (PDF) (PhD). University of Manchester.

[14] A.Zils, F.Pachet, O.Delerue and F. Gouyon, Automatic Extraction of Drum Tracks from Polyphonic Music Signals, Proceedings of WedelMusic, Darmstadt, Germany, 2002.

[15] Chenthamarakshan, Vijil; Desphande, Prasad M; Krishnapuram, Raghu; Varadarajan, Ramakrishnan; Stolze, Knut (2015). "WYSIWYE: An Algebra for Expressing Spatial and Textual Rules for Information Extraction". arXiv:1506.08454 [cs.CL].

[16] 模板:Cite document

[17] Peng, F.; McCallum, A. (2006). "Information extraction from research papers using conditional random fields☆". Information Processing & Management. 42 (4): 963. doi:10.1016/j.ipm.2005.09.002.

[18] Shimizu, Nobuyuki; Hass, Andrew (2006). "Extracting Frame-based Knowledge Representation from Route Instructions" (PDF). Archived from the original (PDF) on 2006-09-01. Retrieved 2010-03-27.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

@@ 第72行： / 第72行： @@
 The present significance of IE pertains to the growing amount of information available in unstructured form. Tim Berners-Lee, inventor of the world wide web, refers to the existing Internet as the web of documents  and advocates that more of the content be made available as a web of data.  Until this transpires, the web largely consists of unstructured documents lacking semantic metadata.  Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form, or by marking-up with XML tags.  An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with.  A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted.R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine",Journal of Natural Language Engineering, Cambridge U. Press, 14(1), 2008, pp.33-69.
-IE 目前的重要性涉及到越来越多的非结构化信息。万维网的发明者 Tim Berners-Lee 将现有的互联网称为文档网络，并主张更多的内容以数据网络的形式提供。在此之前，网络大部分是由缺乏语义元数据的非结构化文档组成的。这些文档中包含的知识可以通过转换为关系形式或使用 XML 标记使机器处理更容易访问。一个监控新闻数据源的智能代理需要 IE 将非结构化数据变成可以理解的东西。IE 的一个典型应用程序是扫描一组用自然语言编写的文档，并用提取的信息填充数据库。牛和康奈尔，《在 foxtract: 一个可定制的中级信息抽取引擎》 ，《自然语言工程杂志》 ，剑桥大学出版社。按14(1) ，2008，pp. 33-69。
+IE 目前的重要意义在于以非结构化的形式获得越来越多的信息。万维网的发明者 Tim Berners-Lee 将现有的互联网称为文档网络，并主张更多的内容以数据网络的形式提供。在此之前，网络大部分是由缺乏语义元数据的非结构化文档组成的。这些文档中包含的知识可以通过转换为关系形式或使用 XML 标记使机器处理更容易访问。一个监控新闻数据源的智能代理需要 IE 将非结构化数据变成可以理解的东西。IE 的一个典型应用程序是扫描一组用自然语言编写的文档，并用提取的信息填充数据库。牛和康奈尔，《在 foxtract: 一个可定制的中级信息抽取引擎》 ，《自然语言工程杂志》 ，剑桥大学出版社。按14(1) ，2008，pp. 33-69。
 ==Tasks and subtasks==
@@ 第79行： / 第79行： @@
 Applying information extraction to text is linked to the problem of text simplification in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical IE tasks and subtasks include:
-任务和子任务 = = 将信息抽取应用于文本链接到文本简化问题，以创建自由文本信息的结构化视图。总体目标是创建一个更容易机器阅读的文本来处理句子。典型的 IE 任务和子任务包括:
+将信息抽取应用于文本是与文本简化问题联系在一起的，以便创建一个自由文本信息的结构化视图。总体目标是创建一个更容易机器阅读的文本来处理句子。典型的 IE 任务和子任务包括:
 * Template filling: Extracting a fixed set of fields from a document, e.g. extract perpetrators, victims, time, etc. from a newspaper article about a terrorist attack.
@@ 第161行： / 第161行： @@
 IE has been the focus of the MUC conferences. The proliferation of the Web, however, intensified the need for developing IE systems that help people to cope with the enormous amount of data that is available online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover, linguistic analysis performed for unstructured text does not exploit the HTML/XML tags and the layout formats that are available in online texts. As a result, less linguistically intensive approaches have been developed for IE on the Web using wrappers, which are sets of highly accurate rules that extract a particular page's content. Manually developing wrappers has proved to be a time-consuming task, requiring a high level of expertise. Machine learning techniques, either supervised or unsupervised, have been used to induce such rules automatically.
-= = 万维网应用程序 = = IE 一直是 MUC 会议的焦点。然而，随着互联网的普及，人们更加需要开发 IE 系统，以帮助人们处理在线可用的大量数据。从在线文本执行 IE 的系统应该满足低成本、开发灵活性和易于适应新领域的要求。MUC 系统不能满足这些标准。此外，对非结构化文本执行的语言分析并没有利用 HTML/XML 标记和在线文本中可用的布局格式。因此，使用包装器为 IE 开发了语言密集度较低的方法，这些包装器是一组高度精确的规则，可以提取特定页面的内容。事实证明，手动开发包装器是一项耗时的任务，需要高水平的专业知识。机器学习技术，无论是监督或无监督，已被用来归纳这些规则自动。
+IE 已经成为 MUC 会议的焦点。然而，随着互联网的普及，人们更加需要开发 IE 系统，以帮助人们处理在线可用的大量数据。从在线文本执行 IE 的系统应该满足低成本、开发灵活性和易于适应新领域的要求。MUC 系统不能满足这些标准。此外，对非结构化文本执行的语言分析并没有利用 HTML/XML 标记和在线文本中可用的布局格式。因此，使用包装器为 IE 开发了语言密集度较低的方法，这些包装器是一组高度精确的规则，可以提取特定页面的内容。事实证明，手动开发包装器是一项耗时的任务，需要高水平的专业知识。机器学习技术，无论是监督或无监督，已被用来归纳这些规则自动。
 ''Wrappers'' typically handle highly structured collections of web pages, such as product catalogs and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on ''adaptive information extraction'' motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured texts.
@@ 第198行： / 第198行： @@
 ** Conditional random fields (CRF) are commonly used in conjunction with IE for tasks as varied as extracting information from research papers to extracting navigation instructions.
-= = = = = 下面的标准方法现在已经被广泛接受:
+下面的标准方法现在已经被广泛接受:
 * 手写的正则表达式(或嵌套的正则表达式组)
 * 使用分类器
 *
-* 生成式贝叶斯
+* 生成式: 幼稚的贝叶斯分类器
-*
-* 判别式: 最大熵模型，如多项式 Logit模型序列模型
-*
-* 递归神经网络隐马尔可夫模型
 *
+* 判别式: 最大熵模型，如多项式 Logit模型
+* 序列模型
 *
+* 递归神经网络
 *
+* 隐马尔可夫模型
 *
 * 条件马尔可夫模型(CMM)/最大熵马尔可夫模型(MEMM)
@@ 第239行： / 第239行： @@
 * See also CRF implementations
-= = = 免费或开放源码软件和服务 = =
 * 文本工程通用体系结构(GATE)与免费信息抽取系统捆绑在一起
 * Apache OpenNLP 是一个用于自然语言处理的 Java 机器学习工具包
@@ 第280行： / 第280行： @@
 * Data extraction
-本体提取人工智能应用概念挖掘美国国防部高级研究计划局 TIPSTER 计划企业搜索面搜索知识提取命名实体识别疯子语义翻译文本挖掘网络抓取开放信息抽取数据提取
+* 本体提取
+* 人工智能应用
+* 概念挖掘
+* DARPA TIPSTER 计划
+* 企业搜索
+* 面搜索
+* 知识提取
+* 命名实体识别
+* Nutch
+* 语义翻译
+* 文本挖掘
+* Web 抓取
+* 开放信息抽取
+* 数据提取
 ; Lists
@@ 第305行： / 第319行： @@
 * Gabor Melli's page on IE Detailed description of the information extraction task.
-= = = 外部链接 = =
 * Alias-I“ competition”页面自然语言信息抽取的学术工具包和工业工具包清单。信息抽取任务的详细描述。