第7行: |
第7行: |
| Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction | | Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction |
| | | |
− | 信息抽取是从非结构化和/或半结构化的机器可读文档和其他电子表示的源中自动提取结构化信息的任务。在大多数情况下,这种活动涉及到通过自然语言处理(NLP)来处理人类语言文本。最近在多媒体文档处理方面的活动,如图像/音频/视频/文档的自动注释和内容提取,可以被视为信息抽取
| + | 信息抽取指从非结构化和/或半结构化的机器可读文档和其他数字化文本中自动提取结构化信息。在大多数情况下,这种活动涉及到通过自然语言处理(NLP)来处理人类语言文本。此外近期一些研究致力于处理多媒体文档,如图像/音频/视频/文档的自动注释和内容提取。也可以被视为信息抽取 |
| | | |
| Due to the difficulty of the problem, current approaches to IE focus on narrowly restricted domains. An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation: | | Due to the difficulty of the problem, current approaches to IE focus on narrowly restricted domains. An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation: |
第19行: |
第19行: |
| :"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp." | | :"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp." |
| | | |
− | 由于这个问题的难度,目前的 IE 方法集中在狭窄的限制领域。一个例子就是从新闻通讯社关于公司合并的报道中提取,例如从一个在线新闻句子中提取: “昨天,总部位于纽约的 Foo 公司宣布收购了 Bar corp. 。”
| + | 目前的信息抽取方法集中在确定的知识域中。例如从公司合并的新闻报道中提取关于公司合并的关系元组: |
| + | |
| + | \mathrm{MergerBetween}(company_1, company_2, date), |
| + | |
| + | 从无结构文本:“昨天,总部位于纽约的 Foo 公司宣布收购了 Bar corp.。”"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp." |
| + | |
| + | 提取出的结构化关系元组: \mathrm{MergerBetween}(Foo Inc, Bar Corp, yesterday), |
| | | |
| A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow [[logical reasoning]] to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and [[context (language use)|context]]. | | A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow [[logical reasoning]] to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and [[context (language use)|context]]. |
第25行: |
第31行: |
| A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context. | | A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context. |
| | | |
− | IE 的一个广泛的目标是允许在之前的非结构化数据上进行计算。一个更具体的目标是允许逻辑推理基于输入数据的逻辑内容做出推论。结构化数据是来自选定目标域的语义上定义良好的数据,根据类别和上下文进行解释。
| + | 广义来说,信息抽取的目标之一是为了下游任务在计算非结构化数据创造基础。具体来说,是允许基于输入数据的逻辑内容做逻辑推论(如关系预测)。结构化数据是来自选定知识域的,语义上定义良好的数据,可以根据类别和上下文进行解释。 |
| | | |
| Information Extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of [[information retrieval]] (IR)<ref>{{Cite journal|url = http://www.cs.bilkent.edu.tr/~guvenir/courses/CS550/Seminar/freitag2000-ml.pdf|title = Machine Learning for Information Extraction in Informal Domains|last = FREITAG|first = DAYNE|journal = 2000 Kluwer Academic Publishers. Printed in the Netherlands}}</ref> has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of [[natural language processing]] (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to “understand” an attack article only enough to find data corresponding to the slots in this template. | | Information Extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of [[information retrieval]] (IR)<ref>{{Cite journal|url = http://www.cs.bilkent.edu.tr/~guvenir/courses/CS550/Seminar/freitag2000-ml.pdf|title = Machine Learning for Information Extraction in Informal Domains|last = FREITAG|first = DAYNE|journal = 2000 Kluwer Academic Publishers. Printed in the Netherlands}}</ref> has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of [[natural language processing]] (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to “understand” an attack article only enough to find data corresponding to the slots in this template. |
第31行: |
第37行: |
| Information Extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR) has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to “understand” an attack article only enough to find data corresponding to the slots in this template. | | Information Extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR) has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to “understand” an attack article only enough to find data corresponding to the slots in this template. |
| | | |
− | 信息抽取是一个更大的难题的一部分,它涉及的问题是设计文本管理的自动方法,超越了它的传输,存储和显示。信息检索学科已经开发出了自动化的方法,典型的统计方法,用于为大型文档集合建立索引和对文档进行分类。另一个互补的方法是自然语言处理(NLP) ,它解决了人类语言处理建模的问题,在考虑到任务的规模时取得了相当大的成功。就难度和重点而言,IE 处理介于 IR 和 NLP 之间的任务。在输入方面,IE 假定存在一组文档,其中每个文档都遵循一个模板,即。以类似于其他文档中的方式描述一个或多个实体或事件,但在细节上有所不同。例如,考虑一组关于拉丁美洲恐怖主义的新闻专线文章,每一条都被认为是基于一种或多种恐怖主义行为。我们还为任何给定的 IE 任务定义了一个模板,它是一个(或一组)案例框架,用于保存单个文档中包含的信息。对于恐怖主义的例子,一个模板应该有与恐怖主义行为的肇事者、受害者和武器相对应的位置,以及事件发生的日期。针对这个问题的 IE 系统需要“理解”一篇攻击文章,只要找到与此模板中插槽相对应的数据即可。
| + | 信息抽取是一个较为上游的任务。它涉及的问题是设计文本管理的自动方法,不再局限于文本的传输,存储和显示。信息检索学科已经开发出了自动化的方法,典型的统计方法,用于为大型文档集合建立索引和对文档进行分类。另一个互补的方法是自然语言处理(NLP) ,它解决了人类语言处理建模的问题,在处理大规模任务时取得了相当的成功。就难度和重点而言,信息抽取(Information Extraction)处理介于信息获取(Information Retrieval,IR)和 NLP 之间的任务。对于IE任务的输入假设为,一组文档,其中每个文档都遵循一个模板,即,以类似于其他文档中的方式描述一个或多个实体或事件,但在细节上有所不同。例如,考虑一组关于拉丁美洲恐怖主义的新闻专线文章,每一条都被认为是基于一种或多种恐怖主义行为。我们还为任何给定的 IE 任务定义了一个模板,它是一个(或一组)案例框架,用于保存单个文档中包含的信息。对于恐怖主义的例子,一个模板应该有与恐怖主义行为的肇事者、受害者和武器相对应的位置,以及事件发生的日期。针对这个问题的 IE 系统需要“理解”一篇攻击文章,只要找到与此模板中插槽相对应的数据即可。 |
| | | |
| ==History== | | ==History== |