第34行: |
第34行: |
| 信息抽取可以追溯到20世纪70年代末 NLP 的早期<ref name=":1" /> 。早期的商业系统是20世纪80年代中期由卡内基集团公司为路透社建立的 JASPER 系统,其目的是为金融交易员提供实时的财经新闻。<ref name=":2" /> | | 信息抽取可以追溯到20世纪70年代末 NLP 的早期<ref name=":1" /> 。早期的商业系统是20世纪80年代中期由卡内基集团公司为路透社建立的 JASPER 系统,其目的是为金融交易员提供实时的财经新闻。<ref name=":2" /> |
| | | |
− | Beginning in 1987, IE was spurred by a series of [[Message Understanding Conference]]s. MUC is a competition-based conference<ref>Marco Costantino, Paolo Coletti, Information Extraction in Finance, Wit Press, 2008. {{ISBN|978-1-84564-146-7}}</ref> that focused on the following domains: | + | Beginning in 1987, IE was spurred by a series of [[Message Understanding Conference]]s. MUC is a competition-based conference<ref name=":5">Marco Costantino, Paolo Coletti, Information Extraction in Finance, Wit Press, 2008. {{ISBN|978-1-84564-146-7}}</ref> that focused on the following domains: |
| *MUC-1 (1987), MUC-2 (1989): Naval operations messages. | | *MUC-1 (1987), MUC-2 (1989): Naval operations messages. |
| *MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries. | | *MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries. |
第41行: |
第41行: |
| *MUC-7 (1998): Satellite launch reports. | | *MUC-7 (1998): Satellite launch reports. |
| | | |
− | 从1987年开始,一系列信息理解会议加速着信息抽取任务的发展。MUC是一个基于竞赛的会议,其主要关注以下领域:
| + | 从1987年开始,一系列信息理解会议加速着信息抽取任务的发展。MUC是一个基于竞赛的会议<ref name=":5" /> ,其主要关注以下领域: |
| * MUC-1(1987) ,MUC-2(1989) : 海军行动信息。 | | * MUC-1(1987) ,MUC-2(1989) : 海军行动信息。 |
| * MUC-3(1991) ,MUC-4(1992) : 拉丁美洲国家的恐怖主义。 | | * MUC-3(1991) ,MUC-4(1992) : 拉丁美洲国家的恐怖主义。 |
第53行: |
第53行: |
| | | |
| ==重要性== | | ==重要性== |
− | The present significance of IE pertains to the growing amount of information available in unstructured form. [[Tim Berners-Lee]], inventor of the [[world wide web]], refers to the existing [[Internet]] as the web of ''documents'' <ref>{{cite web|url=http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf|title=Linked Data - The Story So Far}}</ref> and advocates that more of the content be made available as a [[semantic web|web of ''data'']].<ref>{{cite web|url=http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html|title=Tim Berners-Lee on the next Web}}</ref> Until this transpires, the web largely consists of unstructured documents lacking semantic [[metadata]]. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into [[relational database|relational form]], or by marking-up with [[XML]] tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a [[natural language]] and populate a database with the information extracted.<ref>[[Rohini Kesavan Srihari|R. K. Srihari]], W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine",[https://web.archive.org/web/20080507153920/http://journals.cambridge.org/action/displayIssue?iid=359643 Journal of Natural Language Engineering],{{dead link|date=September 2020}} Cambridge U. Press, 14(1), 2008, pp.33-69.</ref> | + | The present significance of IE pertains to the growing amount of information available in unstructured form. [[Tim Berners-Lee]], inventor of the [[world wide web]], refers to the existing [[Internet]] as the web of ''documents'' <ref name=":6">{{cite web|url=http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf|title=Linked Data - The Story So Far}}</ref> and advocates that more of the content be made available as a [[semantic web|web of ''data'']].<ref name=":7">{{cite web|url=http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html|title=Tim Berners-Lee on the next Web}}</ref> Until this transpires, the web largely consists of unstructured documents lacking semantic [[metadata]]. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into [[relational database|relational form]], or by marking-up with [[XML]] tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a [[natural language]] and populate a database with the information extracted.<ref name=":8">[[Rohini Kesavan Srihari|R. K. Srihari]], W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine",[https://web.archive.org/web/20080507153920/http://journals.cambridge.org/action/displayIssue?iid=359643 Journal of Natural Language Engineering],{{dead link|date=September 2020}} Cambridge U. Press, 14(1), 2008, pp.33-69.</ref> |
| | | |
− | 在于以非结构化信息日益增多的时代,信息抽取的意义也愈发重大。万维网的发明者 Tim Berners-Lee 将现有的互联网称为文档网络,并主张更多的内容以数据网络的形式提供。在此之前,网络大部分是由缺乏语义元数据的非结构化文档组成的。这些文档中包含的知识可以通过转换为关系形式或使用 XML 标记使机器更容易处理和访问。一个监控新闻数据源的智能体需要具备信息抽取能力将非结构化数据变成可用于下游任务推理的结构化信息。I信息抽取的一个典型应用程序是扫描一组用自然语言编写的文档,并用提取的信息填充数据库。 | + | 在于以非结构化信息日益增多的时代,信息抽取的意义也愈发重大。万维网的发明者 Tim Berners-Lee 将现有的互联网称为文档网络 <ref name=":6" /> ,并主张更多的内容以数据网络的形式提供<ref name=":7" />。在此之前,网络大部分是由缺乏语义元数据的非结构化文档组成的。这些文档中包含的知识可以通过转换为关系形式或使用 XML 标记使机器更容易处理和访问。一个监控新闻数据源的智能体需要具备信息抽取能力将非结构化数据变成可用于下游任务推理的结构化信息。I信息抽取的一个典型应用程序是扫描一组用自然语言编写的文档,并用提取的信息填充数据库<ref name=":8" />。 |
| | | |
| ==任务与子任务== | | ==任务与子任务== |
第71行: |
第71行: |
| *** PERSON located in LOCATION (extracted from the sentence "Bill is in France.") | | *** PERSON located in LOCATION (extracted from the sentence "Bill is in France.") |
| * Semi-structured information extraction which may refer to any IE that tries to restore some kind of information structure that has been lost through publication, such as: | | * Semi-structured information extraction which may refer to any IE that tries to restore some kind of information structure that has been lost through publication, such as: |
− | ** Table extraction: finding and extracting tables from documents.<ref>{{cite journal | vauthors = Milosevic N, Gregson C, Hernandez R, Nenadic G | title = A framework for information extraction from tables in biomedical literature | journal = International Journal on Document Analysis and Recognition (IJDAR) | volume = 22 | issue = 1 | pages = 55–78 | date = February 2019 | doi = 10.1007/s10032-019-00317-0 | arxiv = 1902.10031 | bibcode = 2019arXiv190210031M | s2cid = 62880746 }}</ref><ref>{{cite thesis |type=PhD |last=Milosevic |first=Nikola |date=2018 |title=A multi-layered approach to information extraction from tables in biomedical documents |publisher=University of Manchester | url=https://www.research.manchester.ac.uk/portal/files/70405100/FULL_TEXT.PDF}}</ref> | + | ** Table extraction: finding and extracting tables from documents.<ref name=":9">{{cite journal | vauthors = Milosevic N, Gregson C, Hernandez R, Nenadic G | title = A framework for information extraction from tables in biomedical literature | journal = International Journal on Document Analysis and Recognition (IJDAR) | volume = 22 | issue = 1 | pages = 55–78 | date = February 2019 | doi = 10.1007/s10032-019-00317-0 | arxiv = 1902.10031 | bibcode = 2019arXiv190210031M | s2cid = 62880746 }}</ref><ref name=":10">{{cite thesis |type=PhD |last=Milosevic |first=Nikola |date=2018 |title=A multi-layered approach to information extraction from tables in biomedical documents |publisher=University of Manchester | url=https://www.research.manchester.ac.uk/portal/files/70405100/FULL_TEXT.PDF}}</ref> |
| ** Table information extraction : extracting information in structured manner from the tables. This is more complex task than table extraction, as table extraction is only the first step, while understanding the roles of the cells, rows, columns, linking the information inside the table and understanding the information presented in the table are additional tasks necessary for table information extraction. <ref>{{cite journal | vauthors = Milosevic N, Gregson C, Hernandez R, Nenadic G | title = A framework for information extraction from tables in biomedical literature | journal = International Journal on Document Analysis and Recognition (IJDAR) | volume = 22 | issue = 1 | pages = 55–78 | date = February 2019 | doi = 10.1007/s10032-019-00317-0 | arxiv = 1902.10031 | bibcode = 2019arXiv190210031M | s2cid = 62880746 }}</ref><ref>{{cite journal | vauthors = Milosevic N, Gregson C, Hernandez R, Nenadic G | title = Disentangling the structure of tables in scientific literature | journal = 21st International Conference on Applications of Natural Language to Information Systems | series = Lecture Notes in Computer Science | volume = 21 | date = June 2016 | pages = 162–174 | doi = 10.1007/978-3-319-41754-7_14 | isbn = 978-3-319-41753-0 | url = https://www.research.manchester.ac.uk/portal/en/publications/disentangling-the-structure-of-tables-in-scientific-literature(473111c2-52e9-493a-be8c-1a78c5b7ce36).html }}</ref><ref>{{cite thesis |type=PhD |last=Milosevic |first=Nikola |date=2018 |title=A multi-layered approach to information extraction from tables in biomedical documents |publisher=University of Manchester | url=https://www.research.manchester.ac.uk/portal/files/70405100/FULL_TEXT.PDF}}</ref> | | ** Table information extraction : extracting information in structured manner from the tables. This is more complex task than table extraction, as table extraction is only the first step, while understanding the roles of the cells, rows, columns, linking the information inside the table and understanding the information presented in the table are additional tasks necessary for table information extraction. <ref>{{cite journal | vauthors = Milosevic N, Gregson C, Hernandez R, Nenadic G | title = A framework for information extraction from tables in biomedical literature | journal = International Journal on Document Analysis and Recognition (IJDAR) | volume = 22 | issue = 1 | pages = 55–78 | date = February 2019 | doi = 10.1007/s10032-019-00317-0 | arxiv = 1902.10031 | bibcode = 2019arXiv190210031M | s2cid = 62880746 }}</ref><ref>{{cite journal | vauthors = Milosevic N, Gregson C, Hernandez R, Nenadic G | title = Disentangling the structure of tables in scientific literature | journal = 21st International Conference on Applications of Natural Language to Information Systems | series = Lecture Notes in Computer Science | volume = 21 | date = June 2016 | pages = 162–174 | doi = 10.1007/978-3-319-41754-7_14 | isbn = 978-3-319-41753-0 | url = https://www.research.manchester.ac.uk/portal/en/publications/disentangling-the-structure-of-tables-in-scientific-literature(473111c2-52e9-493a-be8c-1a78c5b7ce36).html }}</ref><ref>{{cite thesis |type=PhD |last=Milosevic |first=Nikola |date=2018 |title=A multi-layered approach to information extraction from tables in biomedical documents |publisher=University of Manchester | url=https://www.research.manchester.ac.uk/portal/files/70405100/FULL_TEXT.PDF}}</ref> |
| ** Comments extraction : extracting comments from actual content of article in order to restore the link between author of each sentence | | ** Comments extraction : extracting comments from actual content of article in order to restore the link between author of each sentence |
第77行: |
第77行: |
| **[[Terminology extraction]]: finding the relevant terms for a given [[text corpus|corpus]] | | **[[Terminology extraction]]: finding the relevant terms for a given [[text corpus|corpus]] |
| * Audio extraction | | * Audio extraction |
− | ** Template-based music extraction: finding relevant characteristic in an audio signal taken from a given repertoire; for instance <ref>A.Zils, F.Pachet, O.Delerue and F. Gouyon, [http://www.csl.sony.fr/downloads/papers/2002/ZilsMusic.pdf Automatic Extraction of Drum Tracks from Polyphonic Music Signals], Proceedings of WedelMusic, Darmstadt, Germany, 2002.</ref> time indexes of occurrences of percussive sounds can be extracted in order to represent the essential rhythmic component of a music piece. | + | ** Template-based music extraction: finding relevant characteristic in an audio signal taken from a given repertoire; for instance <ref name=":11">A.Zils, F.Pachet, O.Delerue and F. Gouyon, [http://www.csl.sony.fr/downloads/papers/2002/ZilsMusic.pdf Automatic Extraction of Drum Tracks from Polyphonic Music Signals], Proceedings of WedelMusic, Darmstadt, Germany, 2002.</ref> time indexes of occurrences of percussive sounds can be extracted in order to represent the essential rhythmic component of a music piece. |
| | | |
| * 模板填充: 从文档中提取一组固定的字段,例如。提取肇事者、受害者、时间等。报纸上一篇关于恐怖袭击的文章。 | | * 模板填充: 从文档中提取一组固定的字段,例如。提取肇事者、受害者、时间等。报纸上一篇关于恐怖袭击的文章。 |
第83行: |
第83行: |
| * 事件提取: 给定一个输入文档,输出零个或多个事件模板。例如,一篇报纸文章可能描述了多起恐怖袭击。 | | * 事件提取: 给定一个输入文档,输出零个或多个事件模板。例如,一篇报纸文章可能描述了多起恐怖袭击。 |
| * 知识库填充: 填充给定一组文件的事实数据库。通常数据库是三元组的形式,例如: 实体1,关系,实体2。 | | * 知识库填充: 填充给定一组文件的事实数据库。通常数据库是三元组的形式,例如: 实体1,关系,实体2。 |
− | * 命名实体识别: 利用现有的领域知识或从其他句子中提取的信息,识别已知的实体名称(用于人和组织)、地名、时间表达式和某些类型的数字表达式。一般来说,识别任务需要将一个唯一标识符分配给提取的实体。一个简单的任务是命名实体检测,其目的是检测实体没有任何实体实例的现有知识。例如,在处理”史密斯先生喜欢捕鱼”一句时,命名实体检测将表示检测到”史密斯先生”一词确实指的是一个人,但不一定了解(或使用)某个史密斯先生,他就是(或”可能是”)该句所指的具体人。 | + | * 命名实体识别: 利用现有的领域知识或从其他句子中提取的信息,识别已知的实体名称(用于人和组织)、地名、时间表达式和某些类型的数字表达式<ref name="ecir2019" /> 。一般来说,识别任务需要将一个唯一标识符分配给提取的实体。一个简单的任务是命名实体检测,其目的是检测实体没有任何实体实例的现有知识。例如,在处理”史密斯先生喜欢捕鱼”一句时,命名实体检测将表示检测到”史密斯先生”一词确实指的是一个人,但不一定了解(或使用)某个史密斯先生,他就是(或”可能是”)该句所指的具体人。 |
| * | | * |
− | * 共指消解: 检测文本实体之间的共指和回指链接。在 IE 任务中,这通常局限于查找以前提取的命名实体之间的链接。例如,“ International Business Machines”和“ IBM”指的是相同的实际实体。如果我们把这两个句子取为“史密斯先生喜欢钓鱼。但是他不喜欢骑自行车”,如果能够发现“他”指的是先前被发现的人“ m · 史密斯”,那就更好了。 | + | * 共指消解: 检测文本实体之间的共指和回指链接。在 IE 任务中,这通常局限于查找以前提取的命名实体之间的链接。例如,“ International Business Machines”和“ IBM”指的是相同的实际实体。如果我们把这两个句子取为“史密斯先生喜欢钓鱼。但是他不喜欢骑自行车”,共指消解指能够发现“他”指的是先前被发现的人“ m · 史密斯”。 |
| * | | * |
− | * 关系抽取: 识别实体之间的关系,例如: | + | * 关系抽取: 识别实体之间的关系<ref name="ecir2019" /> ,例如: |
| * | | * |
| * | | * |
第95行: |
第95行: |
| * 半结构化信息抽取,它是试图恢复某种信息结构的信息抽取方法的统称,这种信息结构在发布过程中已经丢失,例如: | | * 半结构化信息抽取,它是试图恢复某种信息结构的信息抽取方法的统称,这种信息结构在发布过程中已经丢失,例如: |
| * | | * |
− | * 表提取: 从文档中查找和提取表。 | + | * 表提取: 从文档中查找和提取表<ref name=":9" /><ref name=":10" />。 |
| * | | * |
| * 表信息抽取: 以结构化方式从表中提取信息。这比表格提取更复杂,因为表格提取只是第一步,而理解单元格、行、列的角色、表格内信息的链接和理解表格中的信息是表格/信息抽取所必需的额外任务。 | | * 表信息抽取: 以结构化方式从表中提取信息。这比表格提取更复杂,因为表格提取只是第一步,而理解单元格、行、列的角色、表格内信息的链接和理解表格中的信息是表格/信息抽取所必需的额外任务。 |
第105行: |
第105行: |
| * 音频提取 | | * 音频提取 |
| * | | * |
− | * 基于模板的音乐提取: 从给定曲目的音频信号中寻找相关特征,例如 A.Zils,F.Pachet,O.Delerue 和 f. Gouyon,自动提取复调音乐信号中的鼓音轨,WedelMusic Proceedings,达姆施塔特,2002。提取敲击音出现的时间索引,以表示音乐作品的基本节奏成分。 | + | * 基于模板的音乐提取: 从给定曲目的音频信号中寻找相关特征,例如 A.Zils,F.Pachet,O.Delerue 和 f. Gouyon <ref name=":11" /> 自动提取复调音乐信号中的鼓音轨,WedelMusic Proceedings,达姆施塔特,2002。提取敲击音出现的时间索引,以表示音乐作品的基本节奏成分。 |
| | | |
| Note that this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning, statistical analysis and/or natural language processing are often used in IE. | | Note that this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning, statistical analysis and/or natural language processing are often used in IE. |
第168行: |
第168行: |
| * | | * |
| * 条件随机场(CRF)通常与 IE 结合使用,用于从研究论文中提取信息以提取导航指令等各种任务。 | | * 条件随机场(CRF)通常与 IE 结合使用,用于从研究论文中提取信息以提取导航指令等各种任务。 |
− |
| |
− | Numerous other approaches exist for IE including hybrid approaches that combine some of the standard approaches previously listed.
| |
| | | |
| Numerous other approaches exist for IE including hybrid approaches that combine some of the standard approaches previously listed. | | Numerous other approaches exist for IE including hybrid approaches that combine some of the standard approaches previously listed. |
第182行: |
第180行: |
| * [[DBpedia Spotlight]] is an open source tool in Java/Scala (and free web service) that can be used for named entity recognition and [[Name resolution (semantics and text extraction)|name resolution]]. | | * [[DBpedia Spotlight]] is an open source tool in Java/Scala (and free web service) that can be used for named entity recognition and [[Name resolution (semantics and text extraction)|name resolution]]. |
| * [[Natural Language Toolkit]] is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language | | * [[Natural Language Toolkit]] is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language |
− | * See also [[Conditional random field#Software|CRF implementation]]s | + | * See also [[Conditional random field#Software|CRF implementation]]s<br /> |
− | | |
− | * General Architecture for Text Engineering (GATE) is bundled with a free Information Extraction system
| |
− | * Apache OpenNLP is a Java machine learning toolkit for natural language processing
| |
− | * OpenCalais is an automated information extraction web service from Thomson Reuters (Free limited version)
| |
− | * Machine Learning for Language Toolkit (Mallet) is a Java-based package for a variety of natural language processing tasks, including information extraction.
| |
− | * DBpedia Spotlight is an open source tool in Java/Scala (and free web service) that can be used for named entity recognition and name resolution.
| |
− | * Natural Language Toolkit is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language
| |
− | * See also CRF implementations
| |
− | | |
| | | |
| * 文本工程通用体系结构(GATE)捆绑了一个免费信息抽取系统 | | * 文本工程通用体系结构(GATE)捆绑了一个免费信息抽取系统 |