“自然语言处理”的版本间的差异

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
跳到导航 跳到搜索
 
第1行: 第1行:
此词条暂由彩云小译翻译,未经人工整理和审校,带来阅读不便,请见谅。
+
已由Thingamabob初步翻译。
  
 
{{distinguish|Nonlinear programming}}[[File:Automated online assistant.png|thumb| 200px |An [[automated online assistant]] providing [[customer service]] on a web page, an example of an application where natural language processing is a major component.<ref name=Kongthon>{{cite conference |doi = 10.1145/1643823.1643908|title = Implementing an online help desk system based on conversational agent |first1= Alisa |last1=Kongthon|first2= Chatchawal|last2= Sangkeettrakarn|first3= Sarawoot|last3= Kongyoung |first4= Choochart |last4 =  Haruechaiyasak|publisher =  ACM |date = October 27–30, 2009 |conference =  MEDES '09: The International Conference on Management of Emergent Digital EcoSystems|location = France }}</ref>]]
 
{{distinguish|Nonlinear programming}}[[File:Automated online assistant.png|thumb| 200px |An [[automated online assistant]] providing [[customer service]] on a web page, an example of an application where natural language processing is a major component.<ref name=Kongthon>{{cite conference |doi = 10.1145/1643823.1643908|title = Implementing an online help desk system based on conversational agent |first1= Alisa |last1=Kongthon|first2= Chatchawal|last2= Sangkeettrakarn|first3= Sarawoot|last3= Kongyoung |first4= Choochart |last4 =  Haruechaiyasak|publisher =  ACM |date = October 27–30, 2009 |conference =  MEDES '09: The International Conference on Management of Emergent Digital EcoSystems|location = France }}</ref>]]
第5行: 第5行:
 
An [[automated online assistant providing customer service on a web page, an example of an application where natural language processing is a major component.]]
 
An [[automated online assistant providing customer service on a web page, an example of an application where natural language processing is a major component.]]
  
一个[在网页上提供客户服务的自动化在线助理,一个自然语言处理是主要组成部分的应用程序示例]
+
网页自动化在线客服,一个自然语言处理起重要作用的例子。
 
 
  
  
第13行: 第12行:
 
Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
 
Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
  
自然语言处理(Natural language processing,NLP)是语言学、计算机科学、信息工程和人工智能等学科的一个分支,它涉及计算机与人类(自然)语言之间的交互,特别是如何编写计算机程序来处理和分析大量的自然语言数据。
+
自然语言处理(Natural language processing,NLP)是语言学、计算机科学、信息工程和人工智能等学科的一个分支。它涉及到计算机与人类语言(自然语言)之间的交互,特别是如何编写计算机程序来处理和分析大量的自然语言数据。
 
 
  
  
第21行: 第19行:
 
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
 
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
  
自然语言处理中的挑战常常涉及到语音识别、自然语言理解和自然语言生成。
+
自然语言处理主要面临着语音识别、自然语言理解和自然语言生成三大挑战。
  
  
  
==History==
+
==历史==
  
  
第33行: 第31行:
 
The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.
 
The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.
  
自然语言处理(NLP)的历史一般开始于20世纪50年代,虽然工作可以从更早的时期找到。
+
尽管相关工作可以追溯到更早,但自然语言处理(NLP)还是通常被认为始于20世纪50年代。
  
 
In 1950, [[Alan Turing]] published an article titled "[[Computing Machinery and Intelligence]]" which proposed what is now called the [[Turing test]] as a criterion of intelligence{{clarify|reason=What is the relationship between the Turing test and NLP?|date=October 2019}}.
 
In 1950, [[Alan Turing]] published an article titled "[[Computing Machinery and Intelligence]]" which proposed what is now called the [[Turing test]] as a criterion of intelligence{{clarify|reason=What is the relationship between the Turing test and NLP?|date=October 2019}}.
第39行: 第37行:
 
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
 
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
  
1950年,阿兰 · 图灵发表了一篇题为《计算机器与智能》的文章,文中提出了现在被称为图灵测试的智能标准。
+
1950年,艾伦 · 图灵发表了一篇题为《计算机器与智能》的文章,文中提出了现在被称为图灵测试的判断机器智能程度的标准。
  
  
第47行: 第45行:
 
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.  However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced.  Little further research in machine translation was conducted until the late 1980s when the first statistical machine translation systems were developed.
 
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.  However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced.  Little further research in machine translation was conducted until the late 1980s when the first statistical machine translation systems were developed.
  
1954年乔治敦大学的实验涉及到超过六十个俄语句子全自动翻译成英语。作者声称,在三到五年内,机器翻译将是一个解决问题。然而,真正的进展要慢得多,在1966年 ALPAC 报告发现长达10年的研究未能达到预期之后,机器翻译的资金大幅减少。直到20世纪80年代后期,第一个统计机器翻译系统被开发出来,机器翻译领域的进一步研究才开始。
+
1954年乔治敦大学做了一个把超过六十个俄语句子全自动翻译成英语的实验。作者声称在三到五年内机器翻译的问题将会被解决。然而真正的进展要慢得多,在1966年ALPAC报告发现长达10年的研究未能达到预期之后,投入到机器翻译领域的资金大幅减少。直到20世纪80年代后期第一个统计机器翻译系统被开发出来,机器翻译领域的进一步研究才得以继续。
  
  
第55行: 第53行:
 
Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966.  Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?".
 
Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966.  Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?".
  
在20世纪60年代开发的一些非常成功的自然语言处理系统是 SHRDLU,一个在有限词汇的有限“块世界”中工作的自然语言系统,和 ELIZA,一个罗杰式心理治疗师的模拟,由约瑟夫·维森鲍姆在1964年和1966年之间编写。Eliza 几乎没有使用任何有关人类思想或情感的信息,有时提供了一种令人吃惊的类似人类的互动。当“病人”超出了非常小的知识库时,ELIZA 可能会提供一般性的反应,例如,用“你为什么说你的头疼? ”来回应“我的头疼”?".
+
SHRDLU和ELIZA是在20世纪60年代开发的两个非常成功的自然语言处理系统。SHRDLU是一个工作在只有有限词汇的 “沙盒游戏”的自然语言系统;而ELIZA是由约瑟夫·维森鲍姆在1964年和1966年之间编写的一个罗杰式模拟心理治疗师。Eliza 几乎没有用到任何有关人类思想或情感的信息,但有时却能做出一些令人吃惊的类似人类的互动。当“病人”的问题超出了它的小知识范围时,ELIZA 可能会给出一般性的回答,例如,用“你为什么说你头疼? ”来回答病人“我头疼”的问题。
 
 
  
  
第63行: 第60行:
 
During the 1970s, many programmers began to write "conceptual ontologies", which structured real-world information into computer-understandable data.  Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981).  During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.
 
During the 1970s, many programmers began to write "conceptual ontologies", which structured real-world information into computer-understandable data.  Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981).  During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.
  
在20世纪70年代,许多程序员开始编写“概念本体” ,将真实世界的信息结构化为计算机可理解的数据。例如 MARGIE (Schank,1975)、 SAM (Cullingford,1978)、 PAM (Wilensky,1978)、 TaleSpin (Meehan,1976)、 QUALM (Lehnert,1977)、 Politics (Carbonell,1979)和 Plot Units (Lehnert,1981)。在此期间,许多聊天机器人被写包括 PARRY,Racter 和 Jabberwacky。
+
20世纪70年代,许多程序员开始编写'''“概念本体”''',这是一种能将真实世界的信息结构化为计算机可理解的数据 。例如 MARGIE (Schank,1975)、 SAM (Cullingford,1978)、 PAM (Wilensky,1978)、 TaleSpin (Meehan,1976)、 QUALM (Lehnert,1977)、 Politics (Carbonell,1979)和 Plot Units (Lehnert,1981)。与此同时也出现了许多聊天机器人,比如 PARRY,Racter 和 Jabberwacky。
 
 
 
 
  
 
Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules.  Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of [[machine learning]] algorithms for language processing.  This was due to both the steady increase in computational power (see [[Moore's law]]) and the gradual lessening of the dominance of [[Noam Chomsky|Chomskyan]] theories of linguistics (e.g. [[transformational grammar]]), whose theoretical underpinnings discouraged the sort of [[corpus linguistics]] that underlies the machine-learning approach to language processing.<ref>Chomskyan linguistics encourages the investigation of "[[corner case]]s" that stress the limits of its theoretical models (comparable to [[pathological (mathematics)|pathological]] phenomena in mathematics), typically created using [[thought experiment]]s, rather than the systematic investigation of typical phenomena that occur in real-world data, as is the case in [[corpus linguistics]].  The creation and use of such [[text corpus|corpora]] of real-world data is a fundamental part of machine-learning algorithms for natural language processing.  In addition, theoretical underpinnings of Chomskyan linguistics such as the so-called "[[poverty of the stimulus]]" argument entail that general learning algorithms, as are typically used in machine learning, cannot be successful in language processing.  As a result, the Chomskyan paradigm discouraged the application of such models to language processing.</ref> Some of the earliest-used machine learning algorithms, such as [[decision tree]]s, produced systems of hard if-then rules similar to existing hand-written rules.  However, [[Part of speech tagging|part-of-speech tagging]] introduced the use of [[hidden Markov models]] to natural language processing, and increasingly, research has focused on [[statistical models]], which make soft, [[probabilistic]] decisions based on attaching [[real-valued]] weights to the features making up the input data. The [[cache language model]]s upon which many [[speech recognition]] systems now rely are examples of such statistical models.  Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.
 
Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules.  Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of [[machine learning]] algorithms for language processing.  This was due to both the steady increase in computational power (see [[Moore's law]]) and the gradual lessening of the dominance of [[Noam Chomsky|Chomskyan]] theories of linguistics (e.g. [[transformational grammar]]), whose theoretical underpinnings discouraged the sort of [[corpus linguistics]] that underlies the machine-learning approach to language processing.<ref>Chomskyan linguistics encourages the investigation of "[[corner case]]s" that stress the limits of its theoretical models (comparable to [[pathological (mathematics)|pathological]] phenomena in mathematics), typically created using [[thought experiment]]s, rather than the systematic investigation of typical phenomena that occur in real-world data, as is the case in [[corpus linguistics]].  The creation and use of such [[text corpus|corpora]] of real-world data is a fundamental part of machine-learning algorithms for natural language processing.  In addition, theoretical underpinnings of Chomskyan linguistics such as the so-called "[[poverty of the stimulus]]" argument entail that general learning algorithms, as are typically used in machine learning, cannot be successful in language processing.  As a result, the Chomskyan paradigm discouraged the application of such models to language processing.</ref> Some of the earliest-used machine learning algorithms, such as [[decision tree]]s, produced systems of hard if-then rules similar to existing hand-written rules.  However, [[Part of speech tagging|part-of-speech tagging]] introduced the use of [[hidden Markov models]] to natural language processing, and increasingly, research has focused on [[statistical models]], which make soft, [[probabilistic]] decisions based on attaching [[real-valued]] weights to the features making up the input data. The [[cache language model]]s upon which many [[speech recognition]] systems now rely are examples of such statistical models.  Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.
第71行: 第66行:
 
Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules.  Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing.  This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.  However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models.  Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.
 
Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules.  Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing.  This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.  However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models.  Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.
  
直到20世纪80年代,大多数自然语言处理系统都是基于复杂的手写规则集。然而,从20世纪80年代末开始,随着语言处理机器学习算法的引入,自然语言处理领域出现了一场革命。这是由于计算能力的稳步增长(见摩尔定律)和乔姆斯基语言学理论的主导地位的逐渐削弱(例如:。他的理论基础不鼓励机器学习语言处理方法背后的转换-生成文法文本语料库。一些最早使用的机器学习算法,比如决策树,产生了类似于现有手写规则的硬 if-then 规则系统。然而,词性标注将隐马尔可夫模型引入到自然语言处理中,并且越来越多地将研究重点放在统计模型上。许多语音识别系统现在所依赖的缓存语言模型就是这种统计模型的例子。这种模型在给定不熟悉的输入,特别是包含错误的输入(对于现实世界的数据来说,这是非常常见的)时,通常更加可靠,并且在集成到一个由多个子任务组成的较大系统中时,产生更可靠的结果。
 
  
 +
到20世纪80年代,大多数自然语言处理系统仍都依赖于复杂的人制定的规则。然而从20世纪80年代末开始,随着语言处理机器学习算法的引入,自然语言处理领域掀起了一场革命。这是由于计算能力的稳步增长(参见<font color=#ff8000>摩尔定律</font>)和<font color=#ff8000>乔姆斯基语言学理论</font>的主导地位逐渐削弱(如<font color=#ff8000>转换语法</font>)。乔姆斯基语言学理论并不认同语料库语言学,而语料库语言学却是语言处理机器学习方法的基础。一些最早被使用的机器学习算法,比如<font color=#ff8000>决策树</font>,产生了使用“如果...那么..."(if-then)硬判决的系统,这种规则类似于之前人类制定的规则。然而,对<font color=#ff8000>词性标注</font>的需求(p.s意译)使得<font color=#ff8000>隐马尔可夫模型</font>被引入到自然语言处理中,并且人们越来越多地将研究重点放在了统计模型上。统计模型将输入数据的各个特征都赋上实值权重,从而做出<font color=#ff8000>软判决</font>和<font color=#ff8000>概率决策</font>。许多语音识别系统现在所依赖的缓存语言模型就是这种统计模型的例子。这种模型在给定不熟悉的输入,特别是包含错误的输入(在实际数据中这是非常常见的)时,通常更加可靠,并且将多个子任务整合到较大系统中时,能产生更可靠的结果。
  
  
第79行: 第74行:
 
Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed.  These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government.  However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.
 
Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed.  These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government.  However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.
  
许多值得注意的早期成功发生在机器翻译领域,特别是由于在 IBM 研究所的工作,在那里先后开发了更复杂的统计模型。这些系统能够利用加拿大议会和欧洲联盟因法律要求将所有政府程序翻译成相应政府系统的所有正式语文而编制的现有多语文文本语料库。然而,大多数其他系统依赖于专门为这些系统所执行的任务开发的语料库,这是(并且经常继续是)这些系统成功的一个主要限制。因此,大量的研究已经投入到从有限的数据中更有效地学习的方法中。
+
许多早期瞩目的成功出现在机器翻译领域,特别是IBM研究所的工作,他们先后开发了更复杂的统计模型。为了实现将所有行政诉讼翻译成相应政府系统的官方语言的法律要求,加拿大议会和欧盟编制了多语言文本语料库,IBM开发的一些系统能够利用这些语料库。然而大多数其他系统都依赖于专门为这些系统所执行任务开发的语料库,这是并且通常一直是这些系统的一个主要限制(p.s.省译)。因此,大量的研究开始探寻如何利用有限的数据更有效地学习的方法。
 
 
  
  
第87行: 第81行:
 
Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms.  Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data.  Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data.  However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.
 
Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms.  Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data.  Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data.  However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.
  
最近的研究越来越多地集中在无监督和半监督学习算法上。这种算法可以从没有手工注释所需答案的数据中学习,或者使用注释和非注释数据的组合。一般来说,这个任务要比监督式学习计算困难得多,而且对于给定数量的输入数据,通常产生的结果不那么精确。然而,有大量没有注释的数据可用(包括万维网的所有内容) ,如果所使用的算法具有足够低的时间复杂度,可以实际使用,这些数据往往可以弥补较差的结果。
+
最近的研究越来越多地集中在<font color=#ff8000>无监督和半监督学习</font>算法上。这些算法可以利用没有人工标注但有预期答案的数据或使用了标注和未标注兼有的数据学习。一般来说,这个任务要比<font color=#ff8000>监督学习</font>计算困难得多,而且对于给定数量的输入数据,产生的结果通常不那么精确。然而如果所使用的算法具有足够低的<font color=#ff8000>时间复杂度</font>,有大量无标注的数据可用(包括其他事物,比如万维网的所有内容)往往可以有效弥补不那么精确的结果。
 
 
  
  
第95行: 第88行:
 
In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques can achieve state-of-the-art results in many natural language tasks, for example in language modeling, parsing, and many others. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).
 
In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques can achieve state-of-the-art results in many natural language tasks, for example in language modeling, parsing, and many others. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).
  
在2010年代,表示学习和深度神经网络风格的机器学习方法在自然语言处理中得到了广泛的应用,部分原因是一系列的结果表明这些技术可以在许多自然语言任务中获得最先进的结果,例如在语言建模、解析和许多其他方面。流行的技术包括使用单词嵌入来获取单词的语义属性,以及增加高级任务的端到端学习(如问答) ,而不是依赖于单独的中间任务管道(如词性标记和依赖性分析)。在某些领域,这种转变使得 NLP 系统的设计发生了重大变化,因此,基于深层神经网络的方法可以被视为一种有别于统计学自然语言处理的新范式。例如,神经机器翻译(neural machine translation,NMT)一词强调了这样一个事实,即基于深度学习的机器翻译方法直接学习序列到序列的转换,从而避免了统计机器翻译(statistical machine translation,SMT)中使用的词对齐和语言建模等中间步骤。
+
二十一世纪一零年代,<font color=#ff8000>表示学习</font>和深度神经网络式的机器学习方法在自然语言处理中得到了广泛的应用,部分原因是一系列的结果表明这些技术可以在许多自然语言任务中获得最先进的结果,比如语言建模、语法分析等。流行的技术包括使用<font color=#ff8000>词嵌入</font>来获取单词的语义属性,以及增加高级任务的<font color=#ff8000>端到端</font>学习(如问答) ,而不是依赖于分立的中间任务流程(如词性标记和依赖性分析)。在某些领域,这种转变使得NLP系统的设计发生了重大变化,因此,基于深层神经网络的方法可以被视为一种有别于统计自然语言处理的新范式。例如,神经机器翻译(neural machine translation,NMT)一词强调了这样一个事实:基于深度学习的机器翻译方法直接学习<font color=#ff8000>序列到序列</font>变换,从而避免了统计机器翻译(statistical machine translation,SMT)中使用的<font color=#ff8000>词对齐</font>和语言建模等中间步骤。
  
  
  
==Rule-based vs. statistical NLP{{anchor|Statistical natural language processing (SNLP)}}==
+
==基于规则的NLP vs. 统计NLP (Rule-based vs. statistical NLP{{anchor|Statistical natural language processing (SNLP)}})==
  
 
In the early days, many language-processing systems were designed by hand-coding a set of rules:<ref name=winograd:shrdlu71>{{cite thesis |last=Winograd |first=Terry |year=1971 |title=Procedures as a Representation for Data in a Computer Program for Understanding Natural Language |url=http://hci.stanford.edu/winograd/shrdlu/ }}</ref><ref name=schank77>{{cite book |first=Roger C. |last=Schank |first2=Robert P. |last2=Abelson |year=1977 |title=Scripts, Plans, Goals, and Understanding: An Inquiry Into Human Knowledge Structures |location=Hillsdale |publisher=Erlbaum |isbn=0-470-99033-3 }}</ref> such as by writing grammars or devising heuristic rules for [[stemming]].  
 
In the early days, many language-processing systems were designed by hand-coding a set of rules:<ref name=winograd:shrdlu71>{{cite thesis |last=Winograd |first=Terry |year=1971 |title=Procedures as a Representation for Data in a Computer Program for Understanding Natural Language |url=http://hci.stanford.edu/winograd/shrdlu/ }}</ref><ref name=schank77>{{cite book |first=Roger C. |last=Schank |first2=Robert P. |last2=Abelson |year=1977 |title=Scripts, Plans, Goals, and Understanding: An Inquiry Into Human Knowledge Structures |location=Hillsdale |publisher=Erlbaum |isbn=0-470-99033-3 }}</ref> such as by writing grammars or devising heuristic rules for [[stemming]].  
第105行: 第98行:
 
In the early days, many language-processing systems were designed by hand-coding a set of rules: such as by writing grammars or devising heuristic rules for stemming.  
 
In the early days, many language-processing systems were designed by hand-coding a set of rules: such as by writing grammars or devising heuristic rules for stemming.  
  
在早期,许多语言处理系统是通过手工编码一组规则来设计的: 例如通过编写语法或设计启发式规则来提取词干。
+
在早期,许多语言处理系统是通过人工编码一组规则来设计的: 例如通过编写语法或设计<font color=#ff8000>启发式</font>规则来提取词干。
  
  
第113行: 第106行:
 
Since the so-called "statistical revolution" in the late 1980s and mid-1990s, much natural language processing research has relied heavily on machine learning. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora (the plural form of corpus, is a set of documents, possibly with human or computer annotations) of typical real-world examples.
 
Since the so-called "statistical revolution" in the late 1980s and mid-1990s, much natural language processing research has relied heavily on machine learning. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora (the plural form of corpus, is a set of documents, possibly with human or computer annotations) of typical real-world examples.
  
自从20世纪80年代末和90年代中期所谓的“统计革命”以来,许多自然语言处理研究严重依赖于机器学习。相反,机器学习的范式要求使用推论统计学通过分析大型语料库(语料库的复数形式,是一组文档,可能带有人或计算机的注释)来自动学习这些规则。
+
自从20世纪80年代末和90年代中期的“统计革命”以来,许多自然语言处理研究都深度依赖机器学习。机器学习的范式要求通过分析大型语料库(corpora,语料库corpus的复数形式,是一组可能带有人或计算机标注的文档)使用统计学推论自动学习这些规则。
  
  
第121行: 第114行:
 
Many different classes of machine-learning algorithms have been applied to natural-language-processing tasks. These algorithms take as input a large set of "features" that are generated from the input data. Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of handwritten rules that were then common. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature. Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.
 
Many different classes of machine-learning algorithms have been applied to natural-language-processing tasks. These algorithms take as input a large set of "features" that are generated from the input data. Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of handwritten rules that were then common. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature. Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.
  
许多不同类型的机器学习算法已被应用于自然语言处理任务。这些算法将从输入数据生成的大量“特性”作为输入。一些最早使用的算法,比如决策树,产生了类似于当时常见的手写规则系统的硬如果-那么规则系统。然而,越来越多的研究集中在统计模型上,这些模型在给每个输入特征附加真实值权重的基础上做出软的、概率性的决策。这种模型的优点是,它们可以表达许多不同的可能答案的相对确定性,而不仅仅是一个,当这种模型作为一个更大系统的组成部分包括在内时,可以产生更可靠的结果。
+
许多不同类型的机器学习算法已被应用在自然语言处理任务中。这些算法将输入数据的大量“特性”作为输入。一些最早被使用的算法,比如<font color=#ff8000>决策树</font>,生成了使用“如果...那么..."(if-then)硬判决的系统,这种规则类似于很常见的人类制定的规则。然而后来人们越来越多地将研究重点放在了统计模型上。统计模型将输入数据的各个特征都赋上实值权重,从而做出<font color=#ff8000>软判决</font>和<font color=#ff8000>概率决策</font>。这种模型的优点是,它们可以表示出许多不同的可能答案的相对确定性,而不仅仅是一个答案。当这种模型作为一个更大系统的模块时,可以产生更可靠的结果。
  
  
第129行: 第122行:
 
Systems based on machine-learning algorithms have many advantages over hand-produced rules:
 
Systems based on machine-learning algorithms have many advantages over hand-produced rules:
  
基于机器学习算法的系统比手工生成的规则有许多优点:
+
基于机器学习算法的系统比起人工制定的规则有许多优点:
  
 
*The learning procedures used during machine learning automatically focus on the most common cases, whereas when writing rules by hand it is often not at all obvious where the effort should be directed.
 
*The learning procedures used during machine learning automatically focus on the most common cases, whereas when writing rules by hand it is often not at all obvious where the effort should be directed.
 +
 +
*机器学习的学习过程自动聚焦于最常见的例子,然而人工制定的规则常常不知道从何下手
  
 
*Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input (e.g. containing words or structures that have not been seen before) and to erroneous input (e.g. with misspelled words or words accidentally omitted). Generally, handling such input gracefully with handwritten rules, or, more generally, creating systems of handwritten rules that make soft decisions, is extremely difficult, error-prone and time-consuming.
 
*Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input (e.g. containing words or structures that have not been seen before) and to erroneous input (e.g. with misspelled words or words accidentally omitted). Generally, handling such input gracefully with handwritten rules, or, more generally, creating systems of handwritten rules that make soft decisions, is extremely difficult, error-prone and time-consuming.
 +
 +
*自动学习过程中可以利用统计推断算法生成对不常见输入(包含未见过的字词或结构)、错误输入(如拼错或无意遗漏词语)有较好鲁棒性的模型。通常用人工制定的规则或建立一个人工制定规则的软决策系统处理这样的输入是极其困难、易于出错且耗费时间的。
  
 
*Systems based on automatically learning the rules can be made more accurate simply by supplying more input data. However, systems based on handwritten rules can only be made more accurate by increasing the complexity of the rules, which is a much more difficult task. In particular, there is a limit to the complexity of systems based on handcrafted rules, beyond which the systems become more and more unmanageable. However, creating more data to input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked, generally without significant increases in the complexity of the annotation process.
 
*Systems based on automatically learning the rules can be made more accurate simply by supplying more input data. However, systems based on handwritten rules can only be made more accurate by increasing the complexity of the rules, which is a much more difficult task. In particular, there is a limit to the complexity of systems based on handcrafted rules, beyond which the systems become more and more unmanageable. However, creating more data to input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked, generally without significant increases in the complexity of the annotation process.
  
 +
基于自动学习规则的系统可以仅用更多的输出就能得到更精确的结果。然而基于人工制定规则的系统只能通过把规则变得复杂来实现提高结果精确度,而制定更复杂的规则这件事本身就很困难。而且基于人工制定规则的系统有一定的限制,超过限制后系统就会变得不可控。然而制造更多数据供给机器学习系统只需要增加相应的人工标注的时间,而且这个过程的复杂度不会有显著改变。
  
 +
==主要评估及任务(Major evaluations and tasks)==
  
==Major evaluations and tasks==
+
The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solvi ng larger tasks.
 
 
The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks.
 
  
 
The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks.
 
The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks.
  
下面是自然语言处理中一些最常被研究的任务的列表。其中一些任务具有直接的现实应用程序,而其他任务则通常作为子任务,用于帮助解决较大的任务。
+
以下列表列出了自然语言处理中一些最常被研究的任务。其中一些任务具有直接的实际应用,而其他任务则通常作为子任务,用于帮助解决更大的任务。
  
  
第153行: 第150行:
 
Though natural language processing tasks are closely intertwined, they are frequently subdivided into categories for convenience. A coarse division is given below.
 
Though natural language processing tasks are closely intertwined, they are frequently subdivided into categories for convenience. A coarse division is given below.
  
尽管自然语言处理任务紧密地交织在一起,但为了方便起见,它们经常被细分为不同的类别。下面给出一个粗略的区分。
+
尽管自然语言处理的各种任务紧密交错,但为了方便,它们常被细分为不同的类别。下面给出一个粗略的分类。
  
  
  
===Syntax===
+
===句法Syntax===
  
 
; [[Grammar induction]]<ref>{{cite journal |last=Klein |first=Dan |first2=Christopher D. |last2=Manning |url=http://papers.nips.cc/paper/1945-natural-language-grammar-induction-using-a-constituent-context-model.pdf |title=Natural language grammar induction using a constituent-context model |journal=Advances in Neural Information Processing Systems |year=2002 }}</ref>: Generate a [[formal grammar]] that describes a language's syntax.
 
; [[Grammar induction]]<ref>{{cite journal |last=Klein |first=Dan |first2=Christopher D. |last2=Manning |url=http://papers.nips.cc/paper/1945-natural-language-grammar-induction-using-a-constituent-context-model.pdf |title=Natural language grammar induction using a constituent-context model |journal=Advances in Neural Information Processing Systems |year=2002 }}</ref>: Generate a [[formal grammar]] that describes a language's syntax.
第163行: 第160行:
 
  Grammar induction: Generate a formal grammar that describes a language's syntax.
 
  Grammar induction: Generate a formal grammar that describes a language's syntax.
  
语法归纳: 生成描述语言句法的正式语法。
+
语法归纳: 生成描述语言句法结构的规范语法。
  
 
; [[Lemmatisation|Lemmatization]]: The task of removing inflectional endings only and to return the base dictionary form of a word which is also known as a lemma.
 
; [[Lemmatisation|Lemmatization]]: The task of removing inflectional endings only and to return the base dictionary form of a word which is also known as a lemma.
第169行: 第166行:
 
  Lemmatization: The task of removing inflectional endings only and to return the base dictionary form of a word which is also known as a lemma.
 
  Lemmatization: The task of removing inflectional endings only and to return the base dictionary form of a word which is also known as a lemma.
  
词典化: 只去掉词形变化的词尾,并返回词的基本词典形式,也称词元。
+
词目化: 只去掉词形变化的词尾,并返回词的基本形式,也称<font color=#ff8000>词目</font>。
  
 
; [[Morphology (linguistics)|Morphological segmentation]]: Separate words into individual [[morpheme]]s and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the [[Morphology (linguistics)|morphology]] (''i.e.'', the structure of words) of the language being considered. [[English language|English]] has fairly simple morphology, especially [[inflectional morphology]], and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (''e.g.'', "open, opens, opened, opening") as separate words. In languages such as [[Turkish language|Turkish]] or [[Meitei language|Meitei]],<ref>{{cite journal |last=Kishorjit |first=N. |last2=Vidya |first2=Raj RK. |last3=Nirmal |first3=Y. |last4=Sivaji |first4=B. |year=2012 |url=http://aclweb.org/anthology//W/W12/W12-5008.pdf |title=Manipuri Morpheme Identification |journal=Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP) |pages=95–108 |location=COLING 2012, Mumbai, December 2012 }}</ref> a highly [[Agglutination|agglutinated]] Indian language, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms.
 
; [[Morphology (linguistics)|Morphological segmentation]]: Separate words into individual [[morpheme]]s and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the [[Morphology (linguistics)|morphology]] (''i.e.'', the structure of words) of the language being considered. [[English language|English]] has fairly simple morphology, especially [[inflectional morphology]], and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (''e.g.'', "open, opens, opened, opening") as separate words. In languages such as [[Turkish language|Turkish]] or [[Meitei language|Meitei]],<ref>{{cite journal |last=Kishorjit |first=N. |last2=Vidya |first2=Raj RK. |last3=Nirmal |first3=Y. |last4=Sivaji |first4=B. |year=2012 |url=http://aclweb.org/anthology//W/W12/W12-5008.pdf |title=Manipuri Morpheme Identification |journal=Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP) |pages=95–108 |location=COLING 2012, Mumbai, December 2012 }}</ref> a highly [[Agglutination|agglutinated]] Indian language, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms.
第175行: 第172行:
 
  Morphological segmentation: Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e., the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g., "open, opens, opened, opening") as separate words. In languages such as Turkish or Meitei, a highly agglutinated Indian language, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms.
 
  Morphological segmentation: Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e., the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g., "open, opens, opened, opening") as separate words. In languages such as Turkish or Meitei, a highly agglutinated Indian language, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms.
  
形态切分: 将单词分成单个语素,并确定语素的类别。这项任务的难度很大程度上取决于所考虑的语言的形态学(即词的结构)的复杂性。英语有相当简单的词法,特别是屈折词法,因此通常可以完全忽略这个任务,而简单地将一个单词的所有可能形式(例如,“ open,opens,opened,opening”)作为单独的单词。然而,在诸如土耳其语或美泰语这样的语言中,这种方法是不可能的,因为每个段落都有成千上万种可能的词形。
+
语素切分: 将单词分成独立的<font color=#ff8000>语素</font>,并确定语素的类别。这项任务的难度很大程度上取决于所考虑的语言的形态(即句子的结构)的复杂性。英语有相当简单的语素,特别是<font color=#ff8000>屈折语素</font>,因此通常可以完全忽略这个任务,而简单地将一个单词的所有可能形式(例如,"open,opens,opened,opening")作为单独的单词。然而,在诸如土耳其语或曼尼普尔语这样的语言中,这种方法是不可取的,因为每个词都有成千上万种可能的词形。
  
 
; [[Part-of-speech tagging]]: Given a sentence, determine the [[part of speech]] (POS) for each word. Many words, especially common ones, can serve as multiple [[parts of speech]]. For example, "book" can be a [[noun]] ("the book on the table") or [[verb]] ("to book a flight"); "set" can be a [[noun]], [[verb]] or [[adjective]]; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others.{{dubious|date=June 2018}} Languages with little [[inflectional morphology]], such as [[English language|English]], are particularly prone to such ambiguity. [[Chinese language|Chinese]] is prone to such ambiguity because it is a [[tonal language]] during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey the intended meaning.
 
; [[Part-of-speech tagging]]: Given a sentence, determine the [[part of speech]] (POS) for each word. Many words, especially common ones, can serve as multiple [[parts of speech]]. For example, "book" can be a [[noun]] ("the book on the table") or [[verb]] ("to book a flight"); "set" can be a [[noun]], [[verb]] or [[adjective]]; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others.{{dubious|date=June 2018}} Languages with little [[inflectional morphology]], such as [[English language|English]], are particularly prone to such ambiguity. [[Chinese language|Chinese]] is prone to such ambiguity because it is a [[tonal language]] during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey the intended meaning.
第181行: 第178行:
 
  Part-of-speech tagging: Given a sentence, determine the part of speech (POS) for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little inflectional morphology, such as English, are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a tonal language during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey the intended meaning.
 
  Part-of-speech tagging: Given a sentence, determine the part of speech (POS) for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little inflectional morphology, such as English, are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a tonal language during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey the intended meaning.
  
词性标注: 给定一个句子,确定每个词的词性。许多单词,尤其是普通的单词,可以充当多个词类。例如,“ book”可以是名词(“ the book on the table”)或动词(“ to book a flight”) ; “ set”可以是名词、动词或形容词; “ out”可以是至少五种不同词类中的任何一种。有些语言比其他语言有更多的这种模糊性。像英语这样几乎没有屈折形态的语言尤其容易出现这种歧义。汉语是一种有声调的语言,在动词化过程中容易出现歧义现象。这样的词形变化不容易通过正字法中使用的实体来传达预期的意思。
+
词性标注: 给定一个句子,确定每个词的词性(part of speech, POS)。许多单词,尤其是常见的单词,可以拥有多种词性。例如,“book”可以是名词(书本)(“ the book on the table”)或动词(预订)(“to book a flight”) ; “set”可以是名词、动词或形容词; “out”至少有五种不同的词性(p.s.意译)。有些语言比其他语言有更多的这种模糊性。像英语这样几乎没有屈折形态的语言尤其容易出现这种歧义。汉语是一种在动词化过程中会变音调的语言,所以容易出现歧义现象。这样的词形变化不容易通过正字法中使用的实体来传达预期的意思。
  
 
; [[Parsing]]: Determine the [[parse tree]] (grammatical analysis) of a given sentence. The [[grammar]] for [[natural language|natural languages]] is [[ambiguous]] and typical sentences have multiple possible analyses: perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). There are two primary types of parsing: ''dependency parsing'' and ''constituency parsing''. Dependency parsing focuses on the relationships between words in a sentence (marking things like primary objects and predicates), whereas constituency parsing focuses on building out the parse tree using a [[probabilistic context-free grammar]] (PCFG) (see also ''[[stochastic grammar]]'').
 
; [[Parsing]]: Determine the [[parse tree]] (grammatical analysis) of a given sentence. The [[grammar]] for [[natural language|natural languages]] is [[ambiguous]] and typical sentences have multiple possible analyses: perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). There are two primary types of parsing: ''dependency parsing'' and ''constituency parsing''. Dependency parsing focuses on the relationships between words in a sentence (marking things like primary objects and predicates), whereas constituency parsing focuses on building out the parse tree using a [[probabilistic context-free grammar]] (PCFG) (see also ''[[stochastic grammar]]'').
第187行: 第184行:
 
  Parsing: Determine the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses: perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). There are two primary types of parsing: dependency parsing and constituency parsing. Dependency parsing focuses on the relationships between words in a sentence (marking things like primary objects and predicates), whereas constituency parsing focuses on building out the parse tree using a probabilistic context-free grammar (PCFG) (see also stochastic grammar).
 
  Parsing: Determine the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses: perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). There are two primary types of parsing: dependency parsing and constituency parsing. Dependency parsing focuses on the relationships between words in a sentence (marking things like primary objects and predicates), whereas constituency parsing focuses on building out the parse tree using a probabilistic context-free grammar (PCFG) (see also stochastic grammar).
  
解析: 确定给定句子的解析树(语法分析)。自然语言的语法是模棱两可的,典型的句子有多种可能的分析: 也许令人惊讶的是,对于一个典型的句子来说,可能有成千上万个潜在的解析(其中大多数对于人类来说是完全无意义的)。有两种主要的解析类型: 依赖性解析和选区解析。依存句法分析专注于句子中单词之间的关系(标记主要对象和谓词之类的东西) ,而选区句法分析专注于使用概率上下文无关文法(PCFG)构建解析树(也参见随机语法)。
+
语法分析: 确定给定句子的<font color=#ff8000>语法树</font>(语法分析)。自然语言的语法是模糊的,典型的句子有多种可能的分析: 也许会让人有些吃惊,一个典型的句子可能有成千上万个潜在的语法分析(其中大多数对于人类来说是毫无意义的)。分析类型主要有两种: <font color=#ff8000>依存分析</font>和<font color=#ff8000>成分分析</font>。依存句法分析侧重于句子中单词之间的关系(标记主要对象和谓语等) ,而成分分析侧重于使用<font color=#ff8000>概率上下文无关文法</font>(probabilistic context-free grammar,PCFG)构建语法树(参见<font color=#ff8000>随机语法</font>)。
  
 
; [[Sentence breaking]] (also known as "[[sentence boundary disambiguation]]"): Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by [[Full stop|periods]] or other [[punctuation mark|punctuation marks]], but these same characters can serve other purposes (''e.g.'', marking [[abbreviation|abbreviations]]).
 
; [[Sentence breaking]] (also known as "[[sentence boundary disambiguation]]"): Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by [[Full stop|periods]] or other [[punctuation mark|punctuation marks]], but these same characters can serve other purposes (''e.g.'', marking [[abbreviation|abbreviations]]).
第193行: 第190行:
 
  Sentence breaking (also known as "sentence boundary disambiguation"): Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g., marking abbreviations).
 
  Sentence breaking (also known as "sentence boundary disambiguation"): Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g., marking abbreviations).
  
断句(也称为“句子边界消歧”) : 给定一个文本块,找到句子边界。句子的边界通常用句号或其他句读来标记,但是这些相同的字符可以用于其他目的(例如,标记缩写)。
+
断句(<font color=#ff8000>句子边界消歧</font>) : 给定一段文本,找到句子边界。句子的边界通常用句号或其他标点符号来标记,但是这些标点符号也会被用于其他目的(例如,标记缩写)。
  
 
; [[Stemming]]: The process of reducing inflected (or sometimes derived) words to their root form. (''e.g.'', "close" will be the root for "closed", "closing", "close", "closer" etc.).
 
; [[Stemming]]: The process of reducing inflected (or sometimes derived) words to their root form. (''e.g.'', "close" will be the root for "closed", "closing", "close", "closer" etc.).
第199行: 第196行:
 
  Stemming: The process of reducing inflected (or sometimes derived) words to their root form. (e.g., "close" will be the root for "closed", "closing", "close", "closer" etc.).
 
  Stemming: The process of reducing inflected (or sometimes derived) words to their root form. (e.g., "close" will be the root for "closed", "closing", "close", "closer" etc.).
  
词根化: 把词形变化(有时是派生出来的)的词减少到其词根形式的过程。(例如,close 是“ closed”、“ closing”、“ close”、“ closer”等的词根。).
+
词根化: 把词形变化(或者派生出来的)的词缩减到其词根形式的过程。(例如,close 是“ closed”、“ closing”、“ close”、“ closer”等的词根。).
  
 
; [[Word segmentation]]: Separate a chunk of continuous text into separate words. For a language like [[English language|English]], this is fairly trivial, since words are usually separated by spaces. However, some written languages like [[Chinese language|Chinese]], [[Japanese language|Japanese]] and [[Thai language|Thai]] do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the [[vocabulary]] and [[Morphology (linguistics)|morphology]] of words in the language. Sometimes this process is also used in cases like bag of words (BOW) creation in data mining.
 
; [[Word segmentation]]: Separate a chunk of continuous text into separate words. For a language like [[English language|English]], this is fairly trivial, since words are usually separated by spaces. However, some written languages like [[Chinese language|Chinese]], [[Japanese language|Japanese]] and [[Thai language|Thai]] do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the [[vocabulary]] and [[Morphology (linguistics)|morphology]] of words in the language. Sometimes this process is also used in cases like bag of words (BOW) creation in data mining.
第205行: 第202行:
 
  Word segmentation: Separate a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language. Sometimes this process is also used in cases like bag of words (BOW) creation in data mining.
 
  Word segmentation: Separate a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language. Sometimes this process is also used in cases like bag of words (BOW) creation in data mining.
  
分词: 把一块连续的文本分割成单独的词。对于像英语这样的语言来说,这是相当琐碎的,因为单词通常由空格分隔。然而,一些书面语言,如汉语、日语和泰语,并没有以这种方式标记词的边界,而且在这些语言中,文本切分是一项重要的任务,需要掌握语言中词汇和词形的知识。有时,这个过程也用于数据挖掘中创建单词包(BOW)的情况。
+
分词: 把一块连续的文本分割成单独的词。对于像英语之类的语言来说,这是相当简单的,因为单词通常由空格分隔。然而,如汉语、日语和泰语的文字,并没有以这种方式标记词的边界,而且在这些语言中,文本切分是一项重要的任务,要求掌握语言中词汇和词形的知识。有时这个过程也被用于数据挖掘中创建<font color=#ff8000>词包</font>(bag of words,BOW)
  
 
; [[Terminology extraction]]: The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
 
; [[Terminology extraction]]: The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
第211行: 第208行:
 
  Terminology extraction: The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
 
  Terminology extraction: The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
  
术语抽取: 术语抽取的目标是从给定的语料库中自动抽取相关术语。
+
术语提取: 术语提取的目标是从给定的语料库中自动提取相关术语。
  
  
  
===Semantics===
+
===语义(Semantics)===
  
 
; [[Lexical semantics]]: What is the computational meaning of individual words in context?
 
; [[Lexical semantics]]: What is the computational meaning of individual words in context?
第221行: 第218行:
 
  Lexical semantics: What is the computational meaning of individual words in context?
 
  Lexical semantics: What is the computational meaning of individual words in context?
  
词汇语义学: 上下文中单个词的计算意义是什么?
+
词汇语义学: 每个词在上下文中的计算意义是什么?
  
 
; [[Distributional semantics]]: How can we learn semantic representations from data?
 
; [[Distributional semantics]]: How can we learn semantic representations from data?
第233行: 第230行:
 
  Machine translation: Automatically translate text from one human language to another.  This is one of the most difficult problems, and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) to solve properly.
 
  Machine translation: Automatically translate text from one human language to another.  This is one of the most difficult problems, and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) to solve properly.
  
机器翻译: 自动将文本从一种语言翻译成另一种语言。这是最困难的问题之一,也是一类通俗地称为“ ai 完全”的问题的一部分。需要人类拥有的所有不同类型的知识(语法、语义、关于现实世界的事实等等)正确地解决。
+
机器翻译: 将文本自动从一种语言翻译成另一种语言。这是最困难的问题之一,也是一类被通俗地称为“人工智能完备”的问题的一部分。需要人类拥有的所有不同类型的知识(语法、语义、对现实世界的事实的认知等)才能正确地解决。
  
 
; [[Named entity recognition]] (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although [[capitalization]] can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient.  For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized.  Furthermore, many other languages in non-Western scripts (e.g. [[Chinese language|Chinese]] or [[Arabic language|Arabic]]) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, [[German language|German]] capitalizes all [[noun]]s, regardless of whether they are names, and [[French language|French]] and [[Spanish language|Spanish]] do not capitalize names that serve as [[adjective]]s.
 
; [[Named entity recognition]] (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although [[capitalization]] can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient.  For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized.  Furthermore, many other languages in non-Western scripts (e.g. [[Chinese language|Chinese]] or [[Arabic language|Arabic]]) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, [[German language|German]] capitalizes all [[noun]]s, regardless of whether they are names, and [[French language|French]] and [[Spanish language|Spanish]] do not capitalize names that serve as [[adjective]]s.
第239行: 第236行:
 
  Named entity recognition (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient.  For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized.  Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they are names, and French and Spanish do not capitalize names that serve as adjectives.
 
  Named entity recognition (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient.  For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized.  Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they are names, and French and Spanish do not capitalize names that serve as adjectives.
  
命名实体识别(NER) : 给定一个文本流,确定文本映射中的哪些项目指向适当的名称,如人或地点,以及每个名称的类型(例如:。人、地点、组织)。虽然大写有助于识别英语等语言中的命名实体,但这种信息无助于确定命名实体的类型,而且无论如何,这种信息往往是不准确或不充分的。例如,一个句子的第一个字母也是大写的,命名实体通常跨越几个单词,只有一些是大写的。此外,许多其他非西方文字的语言(例如:。汉语或阿拉伯语)根本没有大写,即使是有大写的语言也不一定能用它来区分名字。例如,德语将所有名词大写,而不管它们是否是名称,法语和西班牙语不将作为形容词的名称大写。
+
<font color=#ff8000>命名实体识别</font>(Named entity recognition,NER): 给定一个文本流,确定文本中的哪些词能映射到适当的名称,如人或地点,以及这些名称的类型(例如:人名、地点名、组织名)。虽然大写有助于识别英语等语言中的命名实体,但这种信息无助于确定命名实体的类型,而且大部分时候,这种信息往往是不准确或不充分的。比如说,一个句子的第一个字母也是大写的,命名实体通常跨越几个单词,只有一些是大写的。此外,许多其他非西方文字的语言(比如汉语或阿拉伯语)根本没有大写,即使是有大写的语言也不一定能用它来区分名字。比如德语不管一个名词是不是名词都将其大写,法语和西班牙语中作为形容词的名称不大写。
  
 
; [[Natural language generation]]: Convert information from computer databases or semantic intents into readable human language.
 
; [[Natural language generation]]: Convert information from computer databases or semantic intents into readable human language.
第245行: 第242行:
 
  Natural language generation: Convert information from computer databases or semantic intents into readable human language.
 
  Natural language generation: Convert information from computer databases or semantic intents into readable human language.
  
自然语言生成: 将计算机数据库或语义意图中的信息转换为可读的人类语言。
+
自然语言生成: 将计算机数据库或语义意图中的信息转换为人类可读的语言。
  
 
; [[Natural language understanding]]: Convert chunks of text into more formal representations such as [[first-order logic]] structures that are easier for [[computer]] programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural language concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural language semantics without confusions with implicit assumptions such as [[closed-world assumption]] (CWA) vs. [[open-world assumption]], or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization.<ref>{{cite journal |first=Yucong |last=Duan |first2=Christophe |last2=Cruz |year=2011 |url=http://www.ijimt.org/abstract/100-E00187.htm |title=Formalizing Semantic of Natural Language through Conceptualization from Existence |archiveurl=https://web.archive.org/web/20111009135952/http://www.ijimt.org/abstract/100-E00187.htm |archivedate=2011-10-09 |journal=International Journal of Innovation, Management and Technology |volume=2 |issue=1 |pages=37–42 }}</ref>
 
; [[Natural language understanding]]: Convert chunks of text into more formal representations such as [[first-order logic]] structures that are easier for [[computer]] programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural language concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural language semantics without confusions with implicit assumptions such as [[closed-world assumption]] (CWA) vs. [[open-world assumption]], or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization.<ref>{{cite journal |first=Yucong |last=Duan |first2=Christophe |last2=Cruz |year=2011 |url=http://www.ijimt.org/abstract/100-E00187.htm |title=Formalizing Semantic of Natural Language through Conceptualization from Existence |archiveurl=https://web.archive.org/web/20111009135952/http://www.ijimt.org/abstract/100-E00187.htm |archivedate=2011-10-09 |journal=International Journal of Innovation, Management and Technology |volume=2 |issue=1 |pages=37–42 }}</ref>
第251行: 第248行:
 
  Natural language understanding: Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural language concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural language semantics without confusions with implicit assumptions such as closed-world assumption (CWA) vs. open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization.
 
  Natural language understanding: Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural language concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural language semantics without confusions with implicit assumptions such as closed-world assumption (CWA) vs. open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization.
  
自然语言理解: 将文本块转换成更加正式的表示形式,比如一阶逻辑结构,这样计算机程序就更容易操作。自然语言理解涉及到从多种可能的语义中识别预期的语义,这些语义可以从自然语言表达式中派生出来,通常采用自然语言概念的有组织的符号形式。引入和创建语言元模型和本体是有效的,然而经验性的解决方案。自然语言语义的形式化是语义形式化的基础,不能混淆隐含的假设,如封闭世界假设与开放世界假设、主观的是 / 否与客观的真 / 假。
+
自然语言理解: 将文本块转换成更加有条理的表示形式,比如<font color=#ff8000>一阶逻辑结构</font>,这样计算机程序就更容易处理。自然语言理解涉及到从多种可能的语义中选出预期的语义,这些语义可以由有序符号表现的自然语言表达中派生出来。引入和创建语言元模型和本体是有效但经验化的做法。自然语言语义<font color=#32cd32>形式化</font>要求清楚明了,而不能是混有隐含的猜测,如封闭世界假设与开放世界假设、主观的是 / 否与客观的真 / 假。
  
 
; [[Optical character recognition]] (OCR): Given an image representing printed text, determine the corresponding text.
 
; [[Optical character recognition]] (OCR): Given an image representing printed text, determine the corresponding text.
第257行: 第254行:
 
  Optical character recognition (OCR): Given an image representing printed text, determine the corresponding text.
 
  Optical character recognition (OCR): Given an image representing printed text, determine the corresponding text.
  
光学字符识别(OCR) : 给定一幅表示印刷文本的图像,确定相应的文本。
+
光学字符识别( Optical character recognition,OCR) : 给定一幅印有文字的图像,确定相应的文本。
  
 
; [[Question answering]]: Given a human-language question, determine its answer.  Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). Recent works have looked at even more complex questions.<ref>{{cite journal |title=Versatile question answering systems: seeing in synthesis |last=Mittal |journal= International Journal of Intelligent Information and Database Systems|volume=5 |issue=2 |pages=119–142 |year=2011 |doi=10.1504/IJIIDS.2011.038968 |url=https://hal.archives-ouvertes.fr/hal-01104648/file/Mittal_VersatileQA_IJIIDS.pdf }}</ref>
 
; [[Question answering]]: Given a human-language question, determine its answer.  Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). Recent works have looked at even more complex questions.<ref>{{cite journal |title=Versatile question answering systems: seeing in synthesis |last=Mittal |journal= International Journal of Intelligent Information and Database Systems|volume=5 |issue=2 |pages=119–142 |year=2011 |doi=10.1504/IJIIDS.2011.038968 |url=https://hal.archives-ouvertes.fr/hal-01104648/file/Mittal_VersatileQA_IJIIDS.pdf }}</ref>
第263行: 第260行:
 
  Question answering: Given a human-language question, determine its answer.  Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). Recent works have looked at even more complex questions.
 
  Question answering: Given a human-language question, determine its answer.  Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). Recent works have looked at even more complex questions.
  
问题回答: 给出一个人类语言的问题,确定它的答案。典型的问题都有一个明确的正确答案(例如“加拿大的首都是哪里? ”?) ,但有时候开放式的问题也会被考虑(比如“生命的意义是什么? ”)。最近的作品研究了更加复杂的问题。
+
问题回答: 给出一个用人类语言表述的问题,确定它的答案。典型的问题都有一个明确的正确答案(例如“加拿大的首都是哪里? ”),但有时候也需要考虑开放式的问题(比如“生命的意义是什么? ”)。最近的工作研究了更加复杂的问题。
  
 
; [[Textual entailment|Recognizing Textual entailment]]: Given two text fragments, determine if one being true entails the other, entails the other's negation, or allows the other to be either true or false.<ref name=rte:11>PASCAL Recognizing Textual Entailment Challenge (RTE-7) https://tac.nist.gov//2011/RTE/</ref>
 
; [[Textual entailment|Recognizing Textual entailment]]: Given two text fragments, determine if one being true entails the other, entails the other's negation, or allows the other to be either true or false.<ref name=rte:11>PASCAL Recognizing Textual Entailment Challenge (RTE-7) https://tac.nist.gov//2011/RTE/</ref>
第269行: 第266行:
 
  Recognizing Textual entailment: Given two text fragments, determine if one being true entails the other, entails the other's negation, or allows the other to be either true or false.
 
  Recognizing Textual entailment: Given two text fragments, determine if one being true entails the other, entails the other's negation, or allows the other to be either true or false.
  
识别文字蕴涵: 给定两个文本片段,确定其中一个是真实的,另一个是否定的,或者允许另一个是真实的或虚假的。
+
<font color=#ff8000>文本蕴涵识别</font>: 给定两个文本片段,确定其中一个是否蕴含了另一个,或者是否蕴含了另一个的否定,或者是否允许另一个文本中立。
  
 
; [[Relationship extraction]]: Given a chunk of text, identify the relationships among named entities (e.g. who is married to whom).
 
; [[Relationship extraction]]: Given a chunk of text, identify the relationships among named entities (e.g. who is married to whom).
第275行: 第272行:
 
  Relationship extraction: Given a chunk of text, identify the relationships among named entities (e.g. who is married to whom).
 
  Relationship extraction: Given a chunk of text, identify the relationships among named entities (e.g. who is married to whom).
  
关系抽取: 给定一个文本块,识别命名实体之间的关系(例如:。谁是嫁给谁)。
+
关系提取: 给定一个文本块,识别命名实体之间的关系(例如:谁嫁给了谁)。
  
 
; [[Sentiment analysis]] (see also [[multimodal sentiment analysis]]): Extract subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in social media, for marketing.
 
; [[Sentiment analysis]] (see also [[multimodal sentiment analysis]]): Extract subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in social media, for marketing.
第281行: 第278行:
 
  Sentiment analysis (see also multimodal sentiment analysis): Extract subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in social media, for marketing.
 
  Sentiment analysis (see also multimodal sentiment analysis): Extract subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in social media, for marketing.
  
情感分析: 通常从一组文档中提取主观信息,通常使用在线评论来确定特定对象的“极性”。它对于识别社会媒体中的舆论趋势和市场营销特别有用。
+
<font color=#ff8000>情感分析</font>(参见多模态情感分析): 从一组文档中提取主观信息,通常使用在线评论来确定特定对象的“极性”。情感分析在识别社会媒体中的舆论趋势和市场营销中尤其有效。
  
 
; [[Topic segmentation]] and recognition: Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment.
 
; [[Topic segmentation]] and recognition: Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment.
第287行: 第284行:
 
  Topic segmentation and recognition: Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment.
 
  Topic segmentation and recognition: Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment.
  
主题分割和识别: 给定一个文本块,将其分成几个部分,每个部分专门用于一个主题,并确定该部分的主题。
+
主题分割和识别: 给定一个文本块,将其分成几个部分,每个部分都有一个主题,并确定各个部分的主题。
  
 
; [[Word sense disambiguation]]: Many words have more than one [[Meaning (linguistics)|meaning]]; we have to select the meaning which makes the most sense in context.  For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or an online resource such as [[WordNet]].
 
; [[Word sense disambiguation]]: Many words have more than one [[Meaning (linguistics)|meaning]]; we have to select the meaning which makes the most sense in context.  For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or an online resource such as [[WordNet]].
第293行: 第290行:
 
  Word sense disambiguation: Many words have more than one meaning; we have to select the meaning which makes the most sense in context.  For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or an online resource such as WordNet.
 
  Word sense disambiguation: Many words have more than one meaning; we have to select the meaning which makes the most sense in context.  For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or an online resource such as WordNet.
  
词义消歧: 许多词有一个以上的意义,我们必须选择在上下文中最有意义的意义。对于这个问题,我们通常会得到一系列的单词和相关的词义,例如:。从字典或在线资源,如 WordNet。
+
<font color=#ff8000>词义消歧</font>: 许多词有多个意思,我们必须选择最符合上下文的意思。为了解决这个问题,我们通常会从字典或如WordNet的在线资源中取一系列的单词和相关的词义。
  
  
  
===Discourse===
+
===话语(Discourse)===
  
 
; [[Automatic summarization]]:Produce a readable summary of a chunk of text.  Often used to provide summaries of the text of a known type, such as research papers, articles in the financial section of a newspaper.
 
; [[Automatic summarization]]:Produce a readable summary of a chunk of text.  Often used to provide summaries of the text of a known type, such as research papers, articles in the financial section of a newspaper.
第303行: 第300行:
 
  Automatic summarization:Produce a readable summary of a chunk of text.  Often used to provide summaries of the text of a known type, such as research papers, articles in the financial section of a newspaper.
 
  Automatic summarization:Produce a readable summary of a chunk of text.  Often used to provide summaries of the text of a known type, such as research papers, articles in the financial section of a newspaper.
  
生成一个可读的文本自动汇总摘要。常用于提供已知类型文本的摘要,如研究论文、报纸财经版的文章。
+
自动总结:自动生成一个可读的文本摘要。常用于提供已知类型如研究论文、报纸财经版的文章等文本的摘要。
  
 
; [[Coreference|Coreference resolution]]: Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects ("entities"). [[Anaphora resolution]] is a specific example of this task, and is specifically concerned with matching up [[pronoun]]s with the nouns or names to which they refer. The more general task of coreference resolution also includes identifying so-called "bridging relationships" involving [[referring expression]]s. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to).
 
; [[Coreference|Coreference resolution]]: Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects ("entities"). [[Anaphora resolution]] is a specific example of this task, and is specifically concerned with matching up [[pronoun]]s with the nouns or names to which they refer. The more general task of coreference resolution also includes identifying so-called "bridging relationships" involving [[referring expression]]s. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to).
第309行: 第306行:
 
  Coreference resolution: Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects ("entities"). Anaphora resolution is a specific example of this task, and is specifically concerned with matching up pronouns with the nouns or names to which they refer. The more general task of coreference resolution also includes identifying so-called "bridging relationships" involving referring expressions. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to).
 
  Coreference resolution: Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects ("entities"). Anaphora resolution is a specific example of this task, and is specifically concerned with matching up pronouns with the nouns or names to which they refer. The more general task of coreference resolution also includes identifying so-called "bridging relationships" involving referring expressions. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to).
  
共指消解: 给定一个句子或更大的文本块,确定哪些单词(“ mentions”)指的是相同的对象(“ entities”)。指代消解就是这项任务的一个具体实例,它专门研究代词与所指名词或名称的匹配问题。共指消解的一般任务还包括识别涉及指称表达的所谓“桥接关系”。例如,在“ He entered John’s house through the front door”这样的句子中,“ the front door”是一种指称表达方式,需要确定的桥接关系是指这样一个事实,即所指的门是 John 的房子的前门(而不是其他一些也可以指称的结构)。
+
共指消解: 给定一个句子或更大的文本块,确定哪些单词(“指称”)指的是相同的对象(“实体”)。指代消解就是这项任务的一个具体实例,它专门研究代词与所指名词或名称的匹配问题。共指消解的一般任务还包括识别指称之间的“桥接关系”。例如,在“他从前门进入了约翰的房子”这句话中,“前门”是一种指称,需要确定的桥接关系是:所指的门是约翰的房子的前门(而不是其他一些也可以指称的结构)。
  
 
; [[Discourse analysis]]: This rubric includes several related tasks.  One task is identifying the [[discourse]] structure of a connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast).  Another possible task is recognizing and classifying the [[speech act]]s in a chunk of text (e.g. yes-no question, content question, statement, assertion, etc.).
 
; [[Discourse analysis]]: This rubric includes several related tasks.  One task is identifying the [[discourse]] structure of a connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast).  Another possible task is recognizing and classifying the [[speech act]]s in a chunk of text (e.g. yes-no question, content question, statement, assertion, etc.).
第315行: 第312行:
 
  Discourse analysis: This rubric includes several related tasks.  One task is identifying the discourse structure of a connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast).  Another possible task is recognizing and classifying the speech acts in a chunk of text (e.g. yes-no question, content question, statement, assertion, etc.).
 
  Discourse analysis: This rubric includes several related tasks.  One task is identifying the discourse structure of a connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast).  Another possible task is recognizing and classifying the speech acts in a chunk of text (e.g. yes-no question, content question, statement, assertion, etc.).
  
这个篇章分析包括几个相关的任务。一个任务是识别连接文本的语篇结构,即。句子之间话语关系的性质(例如:。详述、解释、对比)。另一个可能的任务是识别和分类一块文本中的言语行为(例如:。是-否问题,内容问题,陈述,断言等。).
+
话语分析:这个部分包括几个相关任务。一个是识别相连文本的语篇结构,即句子之间的话语关系(例如:详述、解释、对比)。还有识别和分类文本块中的言语行为(例如:-否问题,内容问题,陈述,断言等)
 
 
  
  
===Speech===
+
===语音(Speech)===
  
 
; [[Speech recognition]]: Given a sound clip of a person or people speaking, determine the textual representation of the speech.  This is the opposite of [[text to speech]] and is one of the extremely difficult problems colloquially termed "[[AI-complete]]" (see above).  In [[natural speech]] there are hardly any pauses between successive words, and thus [[speech segmentation]] is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed [[coarticulation]], so the conversion of the [[analog signal]] to discrete characters can be a very difficult process. Also, given that words in the same language are spoken by people with different accents, the speech recognition software must be able to recognize the wide variety of input as being identical to each other in terms of its textual equivalent.
 
; [[Speech recognition]]: Given a sound clip of a person or people speaking, determine the textual representation of the speech.  This is the opposite of [[text to speech]] and is one of the extremely difficult problems colloquially termed "[[AI-complete]]" (see above).  In [[natural speech]] there are hardly any pauses between successive words, and thus [[speech segmentation]] is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed [[coarticulation]], so the conversion of the [[analog signal]] to discrete characters can be a very difficult process. Also, given that words in the same language are spoken by people with different accents, the speech recognition software must be able to recognize the wide variety of input as being identical to each other in terms of its textual equivalent.
第325行: 第321行:
 
  Speech recognition: Given a sound clip of a person or people speaking, determine the textual representation of the speech.  This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete" (see above).  In natural speech there are hardly any pauses between successive words, and thus speech segmentation is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process. Also, given that words in the same language are spoken by people with different accents, the speech recognition software must be able to recognize the wide variety of input as being identical to each other in terms of its textual equivalent.
 
  Speech recognition: Given a sound clip of a person or people speaking, determine the textual representation of the speech.  This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete" (see above).  In natural speech there are hardly any pauses between successive words, and thus speech segmentation is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process. Also, given that words in the same language are spoken by people with different accents, the speech recognition software must be able to recognize the wide variety of input as being identical to each other in terms of its textual equivalent.
  
语音识别: 给定一个或多个人说话的声音片段,确定语音的文本表示形式。这是相反的文本语言,是一个极其困难的问题,俗称“人工智能完成”(见上文)。在自然语音中,连续的单词之间几乎没有停顿,因此语音分割是语音识别的一个必要的子任务(见下文)。在大多数口语中,表示连续字母的声音在称为协同发音的过程中相互融合,因此将模拟信号转换为离散字符可能是一个非常困难的过程。此外,由于同一语言的单词由不同口音的人说,语音识别软件必须能够识别各种各样的输入是相同的彼此在其文本等价物。
+
语音识别: 给定一个或多个人说话的声音片段,确定语音的文本内容。这是文本转语音的相反过程,是一个极其困难被称为“人工智能完备”(见上文)的问题。自然语音中连续的单词之间几乎没有停顿,因此语音分割是语音识别的一个必要的子任务(见下文)。在大多数口语中,连续字母的声音在“协同发音”中相互融合,因此将模拟信号转换为离散字符会是一个非常困难的过程。此外,由于说同一个词时不同人的口音不同,所以语音识别软件必须能够识别文本相同的不同输入。
  
 
; [[Speech segmentation]]: Given a sound clip of a person or people speaking, separate it into words.  A subtask of [[speech recognition]] and typically grouped with it.
 
; [[Speech segmentation]]: Given a sound clip of a person or people speaking, separate it into words.  A subtask of [[speech recognition]] and typically grouped with it.
第331行: 第327行:
 
  Speech segmentation: Given a sound clip of a person or people speaking, separate it into words.  A subtask of speech recognition and typically grouped with it.
 
  Speech segmentation: Given a sound clip of a person or people speaking, separate it into words.  A subtask of speech recognition and typically grouped with it.
  
语音切分: 给出一个人或人说话的声音片段,将其分成单词。语音识别的一个子任务,通常与它组合在一起。
+
语音切分: 给一个人或人说话的声音片段,将其分成单词。这是语音识别的一个子任务,通常两者一起出现。
  
 
; [[Text-to-speech]]:Given a text, transform those units and produce a spoken representation. Text-to-speech can be used to aid the visually impaired.<ref>{{Citation|last=Yi|first=Chucai|title=Assistive Text Reading from Complex Background for Blind Persons|date=2012|work=Camera-Based Document Analysis and Recognition|pages=15–28|publisher=Springer Berlin Heidelberg|language=en|doi=10.1007/978-3-642-29364-1_2|isbn=9783642293634|last2=Tian|first2=Yingli|citeseerx=10.1.1.668.869}}</ref>
 
; [[Text-to-speech]]:Given a text, transform those units and produce a spoken representation. Text-to-speech can be used to aid the visually impaired.<ref>{{Citation|last=Yi|first=Chucai|title=Assistive Text Reading from Complex Background for Blind Persons|date=2012|work=Camera-Based Document Analysis and Recognition|pages=15–28|publisher=Springer Berlin Heidelberg|language=en|doi=10.1007/978-3-642-29364-1_2|isbn=9783642293634|last2=Tian|first2=Yingli|citeseerx=10.1.1.668.869}}</ref>
第337行: 第333行:
 
  Text-to-speech:Given a text, transform those units and produce a spoken representation. Text-to-speech can be used to aid the visually impaired.
 
  Text-to-speech:Given a text, transform those units and produce a spoken representation. Text-to-speech can be used to aid the visually impaired.
  
文本转语音(Text-to-speech) : 给定一个文本,转换这些单位并生成一个口语表示。文字到语音可以用来帮助视力受损的人。
+
文本转语音(Text-to-speech) : 给定一个文本,把这些文字转换为声音。文字转语音可以用来帮助视力受损的人。
  
  
  
===[[Dialogue]]===
+
===对话(Dialogue)===
  
 
The first published work by an artificial intelligence was published in 2018, ''[[1 the Road]]'', marketed as a novel, contains sixty million words.
 
The first published work by an artificial intelligence was published in 2018, ''[[1 the Road]]'', marketed as a novel, contains sixty million words.
第347行: 第343行:
 
The first published work by an artificial intelligence was published in 2018, 1 the Road, marketed as a novel, contains sixty million words.
 
The first published work by an artificial intelligence was published in 2018, 1 the Road, marketed as a novel, contains sixty million words.
  
第一部由人工智能出版的作品于2018年出版,名为《路》 ,以小说的形式推出,包含6000万字。
+
第一部人工智能的作品于2018年出版,名为《路》(1 the Road) ,以小说的形式推出,包含6000万字。
  
  
  
==See also==
+
==参见==
  
 
{{div col|colwidth=22em}}
 
{{div col|colwidth=22em}}
第419行: 第415行:
  
  
==References==
+
==参考文献==
  
 
{{Reflist|30em}}
 
{{Reflist|30em}}
第425行: 第421行:
  
  
==Further reading==
+
==扩展阅读==
  
 
<!-- In alphabetical order of by last name -->
 
<!-- In alphabetical order of by last name -->

2020年8月12日 (三) 12:26的版本

已由Thingamabob初步翻译。

模板:Distinguish

文件:Automated online assistant.png
An automated online assistant providing customer service on a web page, an example of an application where natural language processing is a major component.[1]

An automated online assistant providing customer service on a web page, an example of an application where natural language processing is a major component.

网页自动化在线客服,一个自然语言处理起重要作用的例子。


Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

自然语言处理(Natural language processing,NLP)是语言学、计算机科学、信息工程和人工智能等学科的一个分支。它涉及到计算机与人类语言(自然语言)之间的交互,特别是如何编写计算机程序来处理和分析大量的自然语言数据。


Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

自然语言处理主要面临着语音识别、自然语言理解和自然语言生成三大挑战。


历史

The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.

The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.

尽管相关工作可以追溯到更早,但自然语言处理(NLP)还是通常被认为始于20世纪50年代。

In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence模板:Clarify.

In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.

1950年,艾伦 · 图灵发表了一篇题为《计算机器与智能》的文章,文中提出了现在被称为图灵测试的判断机器智能程度的标准。


The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s when the first statistical machine translation systems were developed.

The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s when the first statistical machine translation systems were developed.

1954年乔治敦大学做了一个把超过六十个俄语句子全自动翻译成英语的实验。作者声称在三到五年内机器翻译的问题将会被解决。然而真正的进展要慢得多,在1966年ALPAC报告发现长达10年的研究未能达到预期之后,投入到机器翻译领域的资金大幅减少。直到20世纪80年代后期第一个统计机器翻译系统被开发出来,机器翻译领域的进一步研究才得以继续。


Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?".

Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?".

SHRDLU和ELIZA是在20世纪60年代开发的两个非常成功的自然语言处理系统。SHRDLU是一个工作在只有有限词汇的 “沙盒游戏”的自然语言系统;而ELIZA是由约瑟夫·维森鲍姆在1964年和1966年之间编写的一个罗杰式模拟心理治疗师。Eliza 几乎没有用到任何有关人类思想或情感的信息,但有时却能做出一些令人吃惊的类似人类的互动。当“病人”的问题超出了它的小知识范围时,ELIZA 可能会给出一般性的回答,例如,用“你为什么说你头疼? ”来回答病人“我头疼”的问题。


During the 1970s, many programmers began to write "conceptual ontologies", which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.

During the 1970s, many programmers began to write "conceptual ontologies", which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.

20世纪70年代,许多程序员开始编写“概念本体”,这是一种能将真实世界的信息结构化为计算机可理解的数据 。例如 MARGIE (Schank,1975)、 SAM (Cullingford,1978)、 PAM (Wilensky,1978)、 TaleSpin (Meehan,1976)、 QUALM (Lehnert,1977)、 Politics (Carbonell,1979)和 Plot Units (Lehnert,1981)。与此同时也出现了许多聊天机器人,比如 PARRY,Racter 和 Jabberwacky。

Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.

Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.


到20世纪80年代,大多数自然语言处理系统仍都依赖于复杂的人制定的规则。然而从20世纪80年代末开始,随着语言处理机器学习算法的引入,自然语言处理领域掀起了一场革命。这是由于计算能力的稳步增长(参见摩尔定律)和乔姆斯基语言学理论的主导地位逐渐削弱(如转换语法)。乔姆斯基语言学理论并不认同语料库语言学,而语料库语言学却是语言处理机器学习方法的基础。一些最早被使用的机器学习算法,比如决策树,产生了使用“如果...那么..."(if-then)硬判决的系统,这种规则类似于之前人类制定的规则。然而,对词性标注的需求(p.s意译)使得隐马尔可夫模型被引入到自然语言处理中,并且人们越来越多地将研究重点放在了统计模型上。统计模型将输入数据的各个特征都赋上实值权重,从而做出软判决概率决策。许多语音识别系统现在所依赖的缓存语言模型就是这种统计模型的例子。这种模型在给定不熟悉的输入,特别是包含错误的输入(在实际数据中这是非常常见的)时,通常更加可靠,并且将多个子任务整合到较大系统中时,能产生更可靠的结果。


Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.

Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.

许多早期瞩目的成功出现在机器翻译领域,特别是IBM研究所的工作,他们先后开发了更复杂的统计模型。为了实现将所有行政诉讼翻译成相应政府系统的官方语言的法律要求,加拿大议会和欧盟编制了多语言文本语料库,IBM开发的一些系统能够利用这些语料库。然而大多数其他系统都依赖于专门为这些系统所执行任务开发的语料库,这是并且通常一直是这些系统的一个主要限制(p.s.省译)。因此,大量的研究开始探寻如何利用有限的数据更有效地学习的方法。


Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.

Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.

最近的研究越来越多地集中在无监督和半监督学习算法上。这些算法可以利用没有人工标注但有预期答案的数据或使用了标注和未标注兼有的数据学习。一般来说,这个任务要比监督学习计算困难得多,而且对于给定数量的输入数据,产生的结果通常不那么精确。然而如果所使用的算法具有足够低的时间复杂度,有大量无标注的数据可用(包括其他事物,比如万维网的所有内容)往往可以有效弥补不那么精确的结果。


In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques[4][5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,[6] parsing,[7][8] and many others. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).

In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques can achieve state-of-the-art results in many natural language tasks, for example in language modeling, parsing, and many others. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).

二十一世纪一零年代,表示学习和深度神经网络式的机器学习方法在自然语言处理中得到了广泛的应用,部分原因是一系列的结果表明这些技术可以在许多自然语言任务中获得最先进的结果,比如语言建模、语法分析等。流行的技术包括使用词嵌入来获取单词的语义属性,以及增加高级任务的端到端学习(如问答) ,而不是依赖于分立的中间任务流程(如词性标记和依赖性分析)。在某些领域,这种转变使得NLP系统的设计发生了重大变化,因此,基于深层神经网络的方法可以被视为一种有别于统计自然语言处理的新范式。例如,神经机器翻译(neural machine translation,NMT)一词强调了这样一个事实:基于深度学习的机器翻译方法直接学习序列到序列变换,从而避免了统计机器翻译(statistical machine translation,SMT)中使用的词对齐和语言建模等中间步骤。


基于规则的NLP vs. 统计NLP (Rule-based vs. statistical NLP模板:Anchor)

In the early days, many language-processing systems were designed by hand-coding a set of rules:[9][10] such as by writing grammars or devising heuristic rules for stemming.

In the early days, many language-processing systems were designed by hand-coding a set of rules: such as by writing grammars or devising heuristic rules for stemming.

在早期,许多语言处理系统是通过人工编码一组规则来设计的: 例如通过编写语法或设计启发式规则来提取词干。


Since the so-called "statistical revolution"[11][12] in the late 1980s and mid-1990s, much natural language processing research has relied heavily on machine learning. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora (the plural form of corpus, is a set of documents, possibly with human or computer annotations) of typical real-world examples.

Since the so-called "statistical revolution" in the late 1980s and mid-1990s, much natural language processing research has relied heavily on machine learning. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora (the plural form of corpus, is a set of documents, possibly with human or computer annotations) of typical real-world examples.

自从20世纪80年代末和90年代中期的“统计革命”以来,许多自然语言处理研究都深度依赖机器学习。机器学习的范式要求通过分析大型语料库(corpora,语料库corpus的复数形式,是一组可能带有人或计算机标注的文档)使用统计学推论自动学习这些规则。


Many different classes of machine-learning algorithms have been applied to natural-language-processing tasks. These algorithms take as input a large set of "features" that are generated from the input data. Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of handwritten rules that were then common. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature. Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.

Many different classes of machine-learning algorithms have been applied to natural-language-processing tasks. These algorithms take as input a large set of "features" that are generated from the input data. Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of handwritten rules that were then common. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature. Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.

许多不同类型的机器学习算法已被应用在自然语言处理任务中。这些算法将输入数据的大量“特性”作为输入。一些最早被使用的算法,比如决策树,生成了使用“如果...那么..."(if-then)硬判决的系统,这种规则类似于很常见的人类制定的规则。然而后来人们越来越多地将研究重点放在了统计模型上。统计模型将输入数据的各个特征都赋上实值权重,从而做出软判决概率决策。这种模型的优点是,它们可以表示出许多不同的可能答案的相对确定性,而不仅仅是一个答案。当这种模型作为一个更大系统的模块时,可以产生更可靠的结果。


Systems based on machine-learning algorithms have many advantages over hand-produced rules:

Systems based on machine-learning algorithms have many advantages over hand-produced rules:

基于机器学习算法的系统比起人工制定的规则有许多优点:

  • The learning procedures used during machine learning automatically focus on the most common cases, whereas when writing rules by hand it is often not at all obvious where the effort should be directed.
  • 机器学习的学习过程自动聚焦于最常见的例子,然而人工制定的规则常常不知道从何下手
  • Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input (e.g. containing words or structures that have not been seen before) and to erroneous input (e.g. with misspelled words or words accidentally omitted). Generally, handling such input gracefully with handwritten rules, or, more generally, creating systems of handwritten rules that make soft decisions, is extremely difficult, error-prone and time-consuming.
  • 自动学习过程中可以利用统计推断算法生成对不常见输入(包含未见过的字词或结构)、错误输入(如拼错或无意遗漏词语)有较好鲁棒性的模型。通常用人工制定的规则或建立一个人工制定规则的软决策系统处理这样的输入是极其困难、易于出错且耗费时间的。
  • Systems based on automatically learning the rules can be made more accurate simply by supplying more input data. However, systems based on handwritten rules can only be made more accurate by increasing the complexity of the rules, which is a much more difficult task. In particular, there is a limit to the complexity of systems based on handcrafted rules, beyond which the systems become more and more unmanageable. However, creating more data to input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked, generally without significant increases in the complexity of the annotation process.

基于自动学习规则的系统可以仅用更多的输出就能得到更精确的结果。然而基于人工制定规则的系统只能通过把规则变得复杂来实现提高结果精确度,而制定更复杂的规则这件事本身就很困难。而且基于人工制定规则的系统有一定的限制,超过限制后系统就会变得不可控。然而制造更多数据供给机器学习系统只需要增加相应的人工标注的时间,而且这个过程的复杂度不会有显著改变。

主要评估及任务(Major evaluations and tasks)

The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solvi ng larger tasks.

The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks.

以下列表列出了自然语言处理中一些最常被研究的任务。其中一些任务具有直接的实际应用,而其他任务则通常作为子任务,用于帮助解决更大的任务。


Though natural language processing tasks are closely intertwined, they are frequently subdivided into categories for convenience. A coarse division is given below.

Though natural language processing tasks are closely intertwined, they are frequently subdivided into categories for convenience. A coarse division is given below.

尽管自然语言处理的各种任务紧密交错,但为了方便,它们常被细分为不同的类别。下面给出一个粗略的分类。


句法Syntax

Grammar induction[13]
Generate a formal grammar that describes a language's syntax.
Grammar induction: Generate a formal grammar that describes a language's syntax.

语法归纳: 生成描述语言句法结构的规范语法。

Lemmatization
The task of removing inflectional endings only and to return the base dictionary form of a word which is also known as a lemma.
Lemmatization: The task of removing inflectional endings only and to return the base dictionary form of a word which is also known as a lemma.

词目化: 只去掉词形变化的词尾,并返回词的基本形式,也称词目

Morphological segmentation
Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e., the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g., "open, opens, opened, opening") as separate words. In languages such as Turkish or Meitei,[14] a highly agglutinated Indian language, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms.
Morphological segmentation: Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e., the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g., "open, opens, opened, opening") as separate words. In languages such as Turkish or Meitei, a highly agglutinated Indian language, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms.

语素切分: 将单词分成独立的语素,并确定语素的类别。这项任务的难度很大程度上取决于所考虑的语言的形态(即句子的结构)的复杂性。英语有相当简单的语素,特别是屈折语素,因此通常可以完全忽略这个任务,而简单地将一个单词的所有可能形式(例如,"open,opens,opened,opening")作为单独的单词。然而,在诸如土耳其语或曼尼普尔语这样的语言中,这种方法是不可取的,因为每个词都有成千上万种可能的词形。

Part-of-speech tagging
Given a sentence, determine the part of speech (POS) for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others.模板:Dubious Languages with little inflectional morphology, such as English, are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a tonal language during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey the intended meaning.
Part-of-speech tagging: Given a sentence, determine the part of speech (POS) for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little inflectional morphology, such as English, are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a tonal language during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey the intended meaning.

词性标注: 给定一个句子,确定每个词的词性(part of speech, POS)。许多单词,尤其是常见的单词,可以拥有多种词性。例如,“book”可以是名词(书本)(“ the book on the table”)或动词(预订)(“to book a flight”) ; “set”可以是名词、动词或形容词; “out”至少有五种不同的词性(p.s.意译)。有些语言比其他语言有更多的这种模糊性。像英语这样几乎没有屈折形态的语言尤其容易出现这种歧义。汉语是一种在动词化过程中会变音调的语言,所以容易出现歧义现象。这样的词形变化不容易通过正字法中使用的实体来传达预期的意思。

Parsing
Determine the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses: perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). There are two primary types of parsing: dependency parsing and constituency parsing. Dependency parsing focuses on the relationships between words in a sentence (marking things like primary objects and predicates), whereas constituency parsing focuses on building out the parse tree using a probabilistic context-free grammar (PCFG) (see also stochastic grammar).
Parsing: Determine the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses: perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). There are two primary types of parsing: dependency parsing and constituency parsing. Dependency parsing focuses on the relationships between words in a sentence (marking things like primary objects and predicates), whereas constituency parsing focuses on building out the parse tree using a probabilistic context-free grammar (PCFG) (see also stochastic grammar).

语法分析: 确定给定句子的语法树(语法分析)。自然语言的语法是模糊的,典型的句子有多种可能的分析: 也许会让人有些吃惊,一个典型的句子可能有成千上万个潜在的语法分析(其中大多数对于人类来说是毫无意义的)。分析类型主要有两种: 依存分析成分分析。依存句法分析侧重于句子中单词之间的关系(标记主要对象和谓语等) ,而成分分析侧重于使用概率上下文无关文法(probabilistic context-free grammar,PCFG)构建语法树(参见随机语法)。

Sentence breaking (also known as "sentence boundary disambiguation")
Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g., marking abbreviations).
Sentence breaking (also known as "sentence boundary disambiguation"): Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g., marking abbreviations).

断句(句子边界消歧) : 给定一段文本,找到句子边界。句子的边界通常用句号或其他标点符号来标记,但是这些标点符号也会被用于其他目的(例如,标记缩写)。

Stemming
The process of reducing inflected (or sometimes derived) words to their root form. (e.g., "close" will be the root for "closed", "closing", "close", "closer" etc.).
Stemming: The process of reducing inflected (or sometimes derived) words to their root form. (e.g., "close" will be the root for "closed", "closing", "close", "closer" etc.).

词根化: 把词形变化(或者派生出来的)的词缩减到其词根形式的过程。(例如,close 是“ closed”、“ closing”、“ close”、“ closer”等的词根。).

Word segmentation
Separate a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language. Sometimes this process is also used in cases like bag of words (BOW) creation in data mining.
Word segmentation: Separate a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language. Sometimes this process is also used in cases like bag of words (BOW) creation in data mining.

分词: 把一块连续的文本分割成单独的词。对于像英语之类的语言来说,这是相当简单的,因为单词通常由空格分隔。然而,如汉语、日语和泰语的文字,并没有以这种方式标记词的边界,而且在这些语言中,文本切分是一项重要的任务,要求掌握语言中词汇和词形的知识。有时这个过程也被用于数据挖掘中创建词包(bag of words,BOW)。

Terminology extraction
The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
Terminology extraction: The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

术语提取: 术语提取的目标是从给定的语料库中自动提取相关术语。


语义(Semantics)

Lexical semantics
What is the computational meaning of individual words in context?
Lexical semantics: What is the computational meaning of individual words in context?

词汇语义学: 每个词在上下文中的计算意义是什么?

Distributional semantics
How can we learn semantic representations from data?
Distributional semantics: How can we learn semantic representations from data?

分布式语义: 我们如何从数据中学习语义表示?

Machine translation
Automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) to solve properly.
Machine translation: Automatically translate text from one human language to another.  This is one of the most difficult problems, and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) to solve properly.

机器翻译: 将文本自动从一种语言翻译成另一种语言。这是最困难的问题之一,也是一类被通俗地称为“人工智能完备”的问题的一部分。需要人类拥有的所有不同类型的知识(语法、语义、对现实世界的事实的认知等)才能正确地解决。

Named entity recognition (NER)
Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient. For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they are names, and French and Spanish do not capitalize names that serve as adjectives.
Named entity recognition (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient.  For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized.  Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they are names, and French and Spanish do not capitalize names that serve as adjectives.

命名实体识别(Named entity recognition,NER): 给定一个文本流,确定文本中的哪些词能映射到适当的名称,如人或地点,以及这些名称的类型(例如:人名、地点名、组织名)。虽然大写有助于识别英语等语言中的命名实体,但这种信息无助于确定命名实体的类型,而且大部分时候,这种信息往往是不准确或不充分的。比如说,一个句子的第一个字母也是大写的,命名实体通常跨越几个单词,只有一些是大写的。此外,许多其他非西方文字的语言(比如汉语或阿拉伯语)根本没有大写,即使是有大写的语言也不一定能用它来区分名字。比如德语不管一个名词是不是名词都将其大写,法语和西班牙语中作为形容词的名称不大写。

Natural language generation
Convert information from computer databases or semantic intents into readable human language.
Natural language generation: Convert information from computer databases or semantic intents into readable human language.

自然语言生成: 将计算机数据库或语义意图中的信息转换为人类可读的语言。

Natural language understanding
Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural language concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural language semantics without confusions with implicit assumptions such as closed-world assumption (CWA) vs. open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization.[15]
Natural language understanding: Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural language concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural language semantics without confusions with implicit assumptions such as closed-world assumption (CWA) vs. open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization.

自然语言理解: 将文本块转换成更加有条理的表示形式,比如一阶逻辑结构,这样计算机程序就更容易处理。自然语言理解涉及到从多种可能的语义中选出预期的语义,这些语义可以由有序符号表现的自然语言表达中派生出来。引入和创建语言元模型和本体是有效但经验化的做法。自然语言语义形式化要求清楚明了,而不能是混有隐含的猜测,如封闭世界假设与开放世界假设、主观的是 / 否与客观的真 / 假。

Optical character recognition (OCR)
Given an image representing printed text, determine the corresponding text.
Optical character recognition (OCR): Given an image representing printed text, determine the corresponding text.

光学字符识别( Optical character recognition,OCR) : 给定一幅印有文字的图像,确定相应的文本。

Question answering
Given a human-language question, determine its answer. Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). Recent works have looked at even more complex questions.[16]
Question answering: Given a human-language question, determine its answer.  Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). Recent works have looked at even more complex questions.

问题回答: 给出一个用人类语言表述的问题,确定它的答案。典型的问题都有一个明确的正确答案(例如“加拿大的首都是哪里? ”),但有时候也需要考虑开放式的问题(比如“生命的意义是什么? ”)。最近的工作研究了更加复杂的问题。

Recognizing Textual entailment
Given two text fragments, determine if one being true entails the other, entails the other's negation, or allows the other to be either true or false.[17]
Recognizing Textual entailment: Given two text fragments, determine if one being true entails the other, entails the other's negation, or allows the other to be either true or false.

文本蕴涵识别: 给定两个文本片段,确定其中一个是否蕴含了另一个,或者是否蕴含了另一个的否定,或者是否允许另一个文本中立。

Relationship extraction
Given a chunk of text, identify the relationships among named entities (e.g. who is married to whom).
Relationship extraction: Given a chunk of text, identify the relationships among named entities (e.g. who is married to whom).

关系提取: 给定一个文本块,识别命名实体之间的关系(例如:谁嫁给了谁)。

Sentiment analysis (see also multimodal sentiment analysis)
Extract subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in social media, for marketing.
Sentiment analysis (see also multimodal sentiment analysis): Extract subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in social media, for marketing.

情感分析(参见多模态情感分析): 从一组文档中提取主观信息,通常使用在线评论来确定特定对象的“极性”。情感分析在识别社会媒体中的舆论趋势和市场营销中尤其有效。

Topic segmentation and recognition
Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment.
Topic segmentation and recognition: Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment.

主题分割和识别: 给定一个文本块,将其分成几个部分,每个部分都有一个主题,并确定各个部分的主题。

Word sense disambiguation
Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or an online resource such as WordNet.
Word sense disambiguation: Many words have more than one meaning; we have to select the meaning which makes the most sense in context.  For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or an online resource such as WordNet.

词义消歧: 许多词有多个意思,我们必须选择最符合上下文的意思。为了解决这个问题,我们通常会从字典或如WordNet的在线资源中取一系列的单词和相关的词义。


话语(Discourse)

Automatic summarization
Produce a readable summary of a chunk of text. Often used to provide summaries of the text of a known type, such as research papers, articles in the financial section of a newspaper.
Automatic summarization:Produce a readable summary of a chunk of text.  Often used to provide summaries of the text of a known type, such as research papers, articles in the financial section of a newspaper.

自动总结:自动生成一个可读的文本摘要。常用于提供已知类型如研究论文、报纸财经版的文章等文本的摘要。

Coreference resolution
Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects ("entities"). Anaphora resolution is a specific example of this task, and is specifically concerned with matching up pronouns with the nouns or names to which they refer. The more general task of coreference resolution also includes identifying so-called "bridging relationships" involving referring expressions. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to).
Coreference resolution: Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects ("entities"). Anaphora resolution is a specific example of this task, and is specifically concerned with matching up pronouns with the nouns or names to which they refer. The more general task of coreference resolution also includes identifying so-called "bridging relationships" involving referring expressions. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to).

共指消解: 给定一个句子或更大的文本块,确定哪些单词(“指称”)指的是相同的对象(“实体”)。指代消解就是这项任务的一个具体实例,它专门研究代词与所指名词或名称的匹配问题。共指消解的一般任务还包括识别指称之间的“桥接关系”。例如,在“他从前门进入了约翰的房子”这句话中,“前门”是一种指称,需要确定的桥接关系是:所指的门是约翰的房子的前门(而不是其他一些也可以指称的结构)。

Discourse analysis
This rubric includes several related tasks. One task is identifying the discourse structure of a connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing and classifying the speech acts in a chunk of text (e.g. yes-no question, content question, statement, assertion, etc.).
Discourse analysis: This rubric includes several related tasks.  One task is identifying the discourse structure of a connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast).  Another possible task is recognizing and classifying the speech acts in a chunk of text (e.g. yes-no question, content question, statement, assertion, etc.).

话语分析:这个部分包括几个相关任务。一个是识别相连文本的语篇结构,即句子之间的话语关系(例如:详述、解释、对比)。还有识别和分类文本块中的言语行为(例如:是-否问题,内容问题,陈述,断言等)


语音(Speech)

Speech recognition
Given a sound clip of a person or people speaking, determine the textual representation of the speech. This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete" (see above). In natural speech there are hardly any pauses between successive words, and thus speech segmentation is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process. Also, given that words in the same language are spoken by people with different accents, the speech recognition software must be able to recognize the wide variety of input as being identical to each other in terms of its textual equivalent.
Speech recognition: Given a sound clip of a person or people speaking, determine the textual representation of the speech.  This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete" (see above).  In natural speech there are hardly any pauses between successive words, and thus speech segmentation is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process. Also, given that words in the same language are spoken by people with different accents, the speech recognition software must be able to recognize the wide variety of input as being identical to each other in terms of its textual equivalent.

语音识别: 给定一个或多个人说话的声音片段,确定语音的文本内容。这是文本转语音的相反过程,是一个极其困难被称为“人工智能完备”(见上文)的问题。自然语音中连续的单词之间几乎没有停顿,因此语音分割是语音识别的一个必要的子任务(见下文)。在大多数口语中,连续字母的声音在“协同发音”中相互融合,因此将模拟信号转换为离散字符会是一个非常困难的过程。此外,由于说同一个词时不同人的口音不同,所以语音识别软件必须能够识别文本相同的不同输入。

Speech segmentation
Given a sound clip of a person or people speaking, separate it into words. A subtask of speech recognition and typically grouped with it.
Speech segmentation: Given a sound clip of a person or people speaking, separate it into words.  A subtask of speech recognition and typically grouped with it.

语音切分: 给一个人或人说话的声音片段,将其分成单词。这是语音识别的一个子任务,通常两者一起出现。

Text-to-speech
Given a text, transform those units and produce a spoken representation. Text-to-speech can be used to aid the visually impaired.[18]
Text-to-speech:Given a text, transform those units and produce a spoken representation. Text-to-speech can be used to aid the visually impaired.

文本转语音(Text-to-speech) : 给定一个文本,把这些文字转换为声音。文字转语音可以用来帮助视力受损的人。


对话(Dialogue)

The first published work by an artificial intelligence was published in 2018, 1 the Road, marketed as a novel, contains sixty million words.

The first published work by an artificial intelligence was published in 2018, 1 the Road, marketed as a novel, contains sixty million words.

第一部人工智能的作品于2018年出版,名为《路》(1 the Road) ,以小说的形式推出,包含6000万字。


参见


参考文献

  1. Kongthon, Alisa; Sangkeettrakarn, Chatchawal; Kongyoung, Sarawoot; Haruechaiyasak, Choochart (October 27–30, 2009). Implementing an online help desk system based on conversational agent. MEDES '09: The International Conference on Management of Emergent Digital EcoSystems. France: ACM. doi:10.1145/1643823.1643908.
  2. Hutchins, J. (2005). "The history of machine translation in a nutshell" (PDF).[self-published source]
  3. Chomskyan linguistics encourages the investigation of "corner cases" that stress the limits of its theoretical models (comparable to pathological phenomena in mathematics), typically created using thought experiments, rather than the systematic investigation of typical phenomena that occur in real-world data, as is the case in corpus linguistics. The creation and use of such corpora of real-world data is a fundamental part of machine-learning algorithms for natural language processing. In addition, theoretical underpinnings of Chomskyan linguistics such as the so-called "poverty of the stimulus" argument entail that general learning algorithms, as are typically used in machine learning, cannot be successful in language processing. As a result, the Chomskyan paradigm discouraged the application of such models to language processing.
  4. Goldberg, Yoav (2016). "A Primer on Neural Network Models for Natural Language Processing". Journal of Artificial Intelligence Research. 57: 345–420. arXiv:1807.10854. doi:10.1613/jair.4992.
  5. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org/. 
  6. Jozefowicz, Rafal; Vinyals, Oriol; Schuster, Mike; Shazeer, Noam; Wu, Yonghui (2016). Exploring the Limits of Language Modeling. arXiv:1602.02410. Bibcode 2016arXiv160202410J. 
  7. Choe, Do Kook; Charniak, Eugene. "Parsing as Language Modeling". Emnlp 2016.
  8. Vinyals, Oriol; Kaiser, Lukasz (2014). "Grammar as a Foreign Language" (PDF). Nips2015. arXiv:1412.7449. Bibcode:2014arXiv1412.7449V. {{cite journal}}: Unknown parameter |displayauthors= ignored (help)
  9. Winograd, Terry (1971). Procedures as a Representation for Data in a Computer Program for Understanding Natural Language (Thesis).
  10. Schank, Roger C.; Abelson, Robert P. (1977). Scripts, Plans, Goals, and Understanding: An Inquiry Into Human Knowledge Structures. Hillsdale: Erlbaum. ISBN 0-470-99033-3. 
  11. Mark Johnson. How the statistical revolution changes (computational) linguistics. Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics.
  12. Philip Resnik. Four revolutions. Language Log, February 5, 2011.
  13. Klein, Dan; Manning, Christopher D. (2002). "Natural language grammar induction using a constituent-context model" (PDF). Advances in Neural Information Processing Systems.
  14. Kishorjit, N.; Vidya, Raj RK.; Nirmal, Y.; Sivaji, B. (2012). "Manipuri Morpheme Identification" (PDF). Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP). COLING 2012, Mumbai, December 2012: 95–108.{{cite journal}}: CS1 maint: location (link)
  15. Duan, Yucong; Cruz, Christophe (2011). "Formalizing Semantic of Natural Language through Conceptualization from Existence". International Journal of Innovation, Management and Technology. 2 (1): 37–42. Archived from the original on 2011-10-09.
  16. Mittal (2011). "Versatile question answering systems: seeing in synthesis" (PDF). International Journal of Intelligent Information and Database Systems. 5 (2): 119–142. doi:10.1504/IJIIDS.2011.038968.
  17. PASCAL Recognizing Textual Entailment Challenge (RTE-7) https://tac.nist.gov//2011/RTE/
  18. Yi, Chucai; Tian, Yingli (2012), "Assistive Text Reading from Complex Background for Blind Persons", Camera-Based Document Analysis and Recognition (in English), Springer Berlin Heidelberg, pp. 15–28, CiteSeerX 10.1.1.668.869, doi:10.1007/978-3-642-29364-1_2, ISBN 9783642293634


扩展阅读

! -- 按姓氏的字母顺序 --


模板:Commons


模板:Natural Language Processing

模板:Portal bar

Category:Computational linguistics

类别: 计算语言学

Category:Speech recognition

类别: 语音识别

Category:Computational fields of study

类别: 研究的计算领域

Category:Artificial intelligence

类别: 人工智能


This page was moved from wikipedia:en:Natural language processing. Its edit history can be viewed at 自然语言处理/edithistory