第72行: |
第72行: |
| The present significance of IE pertains to the growing amount of information available in unstructured form. Tim Berners-Lee, inventor of the world wide web, refers to the existing Internet as the web of documents and advocates that more of the content be made available as a web of data. Until this transpires, the web largely consists of unstructured documents lacking semantic metadata. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted.R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine",Journal of Natural Language Engineering, Cambridge U. Press, 14(1), 2008, pp.33-69. | | The present significance of IE pertains to the growing amount of information available in unstructured form. Tim Berners-Lee, inventor of the world wide web, refers to the existing Internet as the web of documents and advocates that more of the content be made available as a web of data. Until this transpires, the web largely consists of unstructured documents lacking semantic metadata. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted.R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine",Journal of Natural Language Engineering, Cambridge U. Press, 14(1), 2008, pp.33-69. |
| | | |
− | IE 目前的重要性涉及到越来越多的非结构化信息。万维网的发明者 Tim Berners-Lee 将现有的互联网称为文档网络,并主张更多的内容以数据网络的形式提供。在此之前,网络大部分是由缺乏语义元数据的非结构化文档组成的。这些文档中包含的知识可以通过转换为关系形式或使用 XML 标记使机器处理更容易访问。一个监控新闻数据源的智能代理需要 IE 将非结构化数据变成可以理解的东西。IE 的一个典型应用程序是扫描一组用自然语言编写的文档,并用提取的信息填充数据库。牛和康奈尔,《在 foxtract: 一个可定制的中级信息抽取引擎》 ,《自然语言工程杂志》 ,剑桥大学出版社。按14(1) ,2008,pp. 33-69。 | + | IE 目前的重要意义在于以非结构化的形式获得越来越多的信息。万维网的发明者 Tim Berners-Lee 将现有的互联网称为文档网络,并主张更多的内容以数据网络的形式提供。在此之前,网络大部分是由缺乏语义元数据的非结构化文档组成的。这些文档中包含的知识可以通过转换为关系形式或使用 XML 标记使机器处理更容易访问。一个监控新闻数据源的智能代理需要 IE 将非结构化数据变成可以理解的东西。IE 的一个典型应用程序是扫描一组用自然语言编写的文档,并用提取的信息填充数据库。牛和康奈尔,《在 foxtract: 一个可定制的中级信息抽取引擎》 ,《自然语言工程杂志》 ,剑桥大学出版社。按14(1) ,2008,pp. 33-69。 |
| | | |
| ==Tasks and subtasks== | | ==Tasks and subtasks== |
第79行: |
第79行: |
| Applying information extraction to text is linked to the problem of text simplification in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical IE tasks and subtasks include: | | Applying information extraction to text is linked to the problem of text simplification in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical IE tasks and subtasks include: |
| | | |
− | 任务和子任务 = = 将信息抽取应用于文本链接到文本简化问题,以创建自由文本信息的结构化视图。总体目标是创建一个更容易机器阅读的文本来处理句子。典型的 IE 任务和子任务包括:
| + | 将信息抽取应用于文本是与文本简化问题联系在一起的,以便创建一个自由文本信息的结构化视图。总体目标是创建一个更容易机器阅读的文本来处理句子。典型的 IE 任务和子任务包括: |
| | | |
| * Template filling: Extracting a fixed set of fields from a document, e.g. extract perpetrators, victims, time, etc. from a newspaper article about a terrorist attack. | | * Template filling: Extracting a fixed set of fields from a document, e.g. extract perpetrators, victims, time, etc. from a newspaper article about a terrorist attack. |
第161行: |
第161行: |
| IE has been the focus of the MUC conferences. The proliferation of the Web, however, intensified the need for developing IE systems that help people to cope with the enormous amount of data that is available online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover, linguistic analysis performed for unstructured text does not exploit the HTML/XML tags and the layout formats that are available in online texts. As a result, less linguistically intensive approaches have been developed for IE on the Web using wrappers, which are sets of highly accurate rules that extract a particular page's content. Manually developing wrappers has proved to be a time-consuming task, requiring a high level of expertise. Machine learning techniques, either supervised or unsupervised, have been used to induce such rules automatically. | | IE has been the focus of the MUC conferences. The proliferation of the Web, however, intensified the need for developing IE systems that help people to cope with the enormous amount of data that is available online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover, linguistic analysis performed for unstructured text does not exploit the HTML/XML tags and the layout formats that are available in online texts. As a result, less linguistically intensive approaches have been developed for IE on the Web using wrappers, which are sets of highly accurate rules that extract a particular page's content. Manually developing wrappers has proved to be a time-consuming task, requiring a high level of expertise. Machine learning techniques, either supervised or unsupervised, have been used to induce such rules automatically. |
| | | |
− | = = 万维网应用程序 = = IE 一直是 MUC 会议的焦点。然而,随着互联网的普及,人们更加需要开发 IE 系统,以帮助人们处理在线可用的大量数据。从在线文本执行 IE 的系统应该满足低成本、开发灵活性和易于适应新领域的要求。MUC 系统不能满足这些标准。此外,对非结构化文本执行的语言分析并没有利用 HTML/XML 标记和在线文本中可用的布局格式。因此,使用包装器为 IE 开发了语言密集度较低的方法,这些包装器是一组高度精确的规则,可以提取特定页面的内容。事实证明,手动开发包装器是一项耗时的任务,需要高水平的专业知识。机器学习技术,无论是监督或无监督,已被用来归纳这些规则自动。
| + | IE 已经成为 MUC 会议的焦点。然而,随着互联网的普及,人们更加需要开发 IE 系统,以帮助人们处理在线可用的大量数据。从在线文本执行 IE 的系统应该满足低成本、开发灵活性和易于适应新领域的要求。MUC 系统不能满足这些标准。此外,对非结构化文本执行的语言分析并没有利用 HTML/XML 标记和在线文本中可用的布局格式。因此,使用包装器为 IE 开发了语言密集度较低的方法,这些包装器是一组高度精确的规则,可以提取特定页面的内容。事实证明,手动开发包装器是一项耗时的任务,需要高水平的专业知识。机器学习技术,无论是监督或无监督,已被用来归纳这些规则自动。 |
| | | |
| ''Wrappers'' typically handle highly structured collections of web pages, such as product catalogs and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on ''adaptive information extraction'' motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured texts. | | ''Wrappers'' typically handle highly structured collections of web pages, such as product catalogs and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on ''adaptive information extraction'' motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured texts. |
第198行: |
第198行: |
| ** Conditional random fields (CRF) are commonly used in conjunction with IE for tasks as varied as extracting information from research papers to extracting navigation instructions. | | ** Conditional random fields (CRF) are commonly used in conjunction with IE for tasks as varied as extracting information from research papers to extracting navigation instructions. |
| | | |
− | = = = = = 下面的标准方法现在已经被广泛接受:
| + | 下面的标准方法现在已经被广泛接受: |
| * 手写的正则表达式(或嵌套的正则表达式组) | | * 手写的正则表达式(或嵌套的正则表达式组) |
| * 使用分类器 | | * 使用分类器 |
| * | | * |
− | * 生成式贝叶斯 | + | * 生成式: 幼稚的贝叶斯分类器 |
− | *
| |
− | * 判别式: 最大熵模型,如多项式 Logit模型序列模型
| |
− | *
| |
− | * 递归神经网络隐马尔可夫模型
| |
| * | | * |
| + | * 判别式: 最大熵模型,如多项式 Logit模型 |
| + | * 序列模型 |
| * | | * |
| + | * 递归神经网络 |
| * | | * |
| + | * 隐马尔可夫模型 |
| * | | * |
| * 条件马尔可夫模型(CMM)/最大熵马尔可夫模型(MEMM) | | * 条件马尔可夫模型(CMM)/最大熵马尔可夫模型(MEMM) |
第239行: |
第239行: |
| * See also CRF implementations | | * See also CRF implementations |
| | | |
− | = = = 免费或开放源码软件和服务 = =
| + | |
| * 文本工程通用体系结构(GATE)与免费信息抽取系统捆绑在一起 | | * 文本工程通用体系结构(GATE)与免费信息抽取系统捆绑在一起 |
| * Apache OpenNLP 是一个用于自然语言处理的 Java 机器学习工具包 | | * Apache OpenNLP 是一个用于自然语言处理的 Java 机器学习工具包 |
第280行: |
第280行: |
| * Data extraction | | * Data extraction |
| | | |
− | 本体提取人工智能应用概念挖掘美国国防部高级研究计划局 TIPSTER 计划企业搜索面搜索知识提取命名实体识别疯子语义翻译文本挖掘网络抓取开放信息抽取数据提取
| + | |
| + | * 本体提取 |
| + | * 人工智能应用 |
| + | * 概念挖掘 |
| + | * DARPA TIPSTER 计划 |
| + | * 企业搜索 |
| + | * 面搜索 |
| + | * 知识提取 |
| + | * 命名实体识别 |
| + | * Nutch |
| + | * 语义翻译 |
| + | * 文本挖掘 |
| + | * Web 抓取 |
| + | * 开放信息抽取 |
| + | * 数据提取 |
| | | |
| ; Lists | | ; Lists |
第305行: |
第319行: |
| * Gabor Melli's page on IE Detailed description of the information extraction task. | | * Gabor Melli's page on IE Detailed description of the information extraction task. |
| | | |
− | = = = 外部链接 = =
| + | |
| * Alias-I“ competition”页面自然语言信息抽取的学术工具包和工业工具包清单。信息抽取任务的详细描述。 | | * Alias-I“ competition”页面自然语言信息抽取的学术工具包和工业工具包清单。信息抽取任务的详细描述。 |
| | | |