更改

自然语言处理 (查看源代码)

2020年8月23日 (日) 23:55的版本

删除325字节、 2020年8月23日 (日) 23:55

无编辑摘要

第12行：第12行：

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

−

'''自然语言处理 Natural Language Processing'''是'''语言学 Linguistics'''、'''计算机科学 Computer Science'''、'''信息工程 Infomation Engineering'''和'''人工智能 Artificial Intelligence'''~~等学科的一个分支。它涉及到计算机与人类语言（自然语言）之间的交互，特别是如何编写计算机程序来处理和分析大量的自然语言数据。~~

+

'''自然语言处理 Natural Language Processing'''是'''语言学 Linguistics'''、'''计算机科学 Computer Science'''、'''信息工程 Infomation Engineering'''和'''人工智能 Artificial Intelligence'''等领域的分支学科。它涉及到计算机与人类语言（自然语言）之间的交互，特别是如何编写计算机程序来处理和分析大量的自然语言数据。

第19行：第19行：

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

−

自然语言处理主要面临着'''语音识别 Speech Recognition'''、'''自然语言理解 Natural Language Understanding'''和'''自然语言生成 Natural Language Generation'''三大挑战。

+

自然语言处理主要面临着'''语音识别 Speech Recognition'''、'''自然语言理解 Natural Language Understanding'''和'''自然语言生成 Natural Language Generation'''三大挑战。</ref></ref>

第37行：第37行：

In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.

−

1950年，艾伦 · ~~图灵发表了一篇题为《计算机器与智能》的文章，文中提出了现在被称为~~'''图灵测试 Turing Test'''~~的判断机器智能程度的标准。~~

+

1950年，艾伦 · 图灵发表《计算机器与智能》一文，提出'''图灵测试 Turing Test'''作为判断机器智能程度的标准。

第45行：第45行：

The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s when the first statistical machine translation systems were developed.

−

1954年乔治敦大学做了一个把六十多个俄语句子自动翻译成英语的实验。作者声称在三到五年内机器翻译的问题将会被解决。然而真正的进展要慢得多，在1966年ALPAC报告发现长达10年的研究未能达到预期之后，投入到机器翻译领域的资金大幅减少。直到20世纪80年代后期第一个'''统计机器翻译 Statistical Machine Translation'''~~系统被开发出来，机器翻译领域的进一步研究才得以继续。~~

+

1954年乔治敦大学成功将六十多个俄语句子自动翻译成了英语。作者声称在三到五年内将解决机器翻译问题，然而，事实上的进展要缓慢得多，1966年的ALPAC报告认为，长达10年的研究并未达到预期目标。自此之后，投入到机器翻译领域的资金急剧减少。直到20世纪80年代后期，当第一个'''统计机器翻译 Statistical Machine Translation'''系统被开发出来以后，机器翻译的研究才得以进一步推进。

第53行：第53行：

Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?".

−

SHRDLU和ELIZA是在20世纪60年代开发的两个非常成功的自然语言处理系统。SHRDLU是一个工作在只有有限词汇的 “沙盒游戏”的自然语言系统；而ELIZA是由约瑟夫·维森鲍姆在1964年和1966年之间编写的一个罗杰式模拟心理治疗师。ELIZA几乎没有用到任何有关人类思想或情感的信息，但有时却能做出一些令人吃惊的类似人类的互动。当“病人”的问题超出了它的有限的知识范围时，ELIZA 可能会给出一般性的回答，例如，用“你为什么说你头疼? ~~”来回答病人“我头疼”的问题。~~

+

SHRDLU和ELIZA是于20世纪60年代开发的两款非常成功的自然语言处理系统。其中，SHRDLU是一个工作在词汇有限的“积木世界”的自然语言系统；而ELIZA则是一款由约瑟夫·维森鲍姆在1964年至1966年之间编写的罗杰式模拟心理治疗师。ELIZA几乎没有使用任何有关人类思想或情感的信息，但有时却能做出一些令人吃惊的类似人类之间存在的互动。当“病人”的问题超出了它有限的知识范围时，ELIZA很可能会给出一般性的回复。例如，它可能会用“你为什么说你头疼? ”来回答病人提出的“我的头疼”之类的问题。

第60行：第60行：

During the 1970s, many programmers began to write "conceptual ontologies", which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.

−

~~20世纪70年代，许多程序员开始编写~~'''概念本体论 Conceptual Ontology'''~~，将真实世界的信息结构化为计算机可理解的数据。例如~~ MARGIE (Schank，1975)、 SAM (Cullingford，1978)、 PAM (Wilensky，1978)、 TaleSpin (Meehan，1976)、 QUALM (Lehnert，1977)、 Politics (Carbonell，1979)和 Plot Units (Lehnert，1981)~~。与此同时也出现了许多聊天机器人，比如~~ PARRY，Racter 和 Jabberwacky。

+

20世纪70年代，程序员开始编写'''概念本体论 Conceptual Ontology'''程序，将真实世界的信息结构化为计算机可理解的数据，如 MARGIE (Schank，1975)、 SAM (Cullingford，1978)、 PAM (Wilensky，1978)、 TaleSpin (Meehan，1976)、 QUALM (Lehnert，1977)、 Politics (Carbonell，1979)和 Plot Units (Lehnert，1981)。与此同时也出现了许多聊天机器人，如 PARRY，Racter 和 Jabberwacky。

Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of [[machine learning]] algorithms for language processing. This was due to both the steady increase in computational power (see [[Moore's law]]) and the gradual lessening of the dominance of [[Noam Chomsky|Chomskyan]] theories of linguistics (e.g. [[transformational grammar]]), whose theoretical underpinnings discouraged the sort of [[corpus linguistics]] that underlies the machine-learning approach to language processing.<ref>Chomskyan linguistics encourages the investigation of "[[corner case]]s" that stress the limits of its theoretical models (comparable to [[pathological (mathematics)|pathological]] phenomena in mathematics), typically created using [[thought experiment]]s, rather than the systematic investigation of typical phenomena that occur in real-world data, as is the case in [[corpus linguistics]]. The creation and use of such [[text corpus|corpora]] of real-world data is a fundamental part of machine-learning algorithms for natural language processing. In addition, theoretical underpinnings of Chomskyan linguistics such as the so-called "[[poverty of the stimulus]]" argument entail that general learning algorithms, as are typically used in machine learning, cannot be successful in language processing. As a result, the Chomskyan paradigm discouraged the application of such models to language processing.</ref> Some of the earliest-used machine learning algorithms, such as [[decision tree]]s, produced systems of hard if-then rules similar to existing hand-written rules. However, [[Part of speech tagging|part-of-speech tagging]] introduced the use of [[hidden Markov models]] to natural language processing, and increasingly, research has focused on [[statistical models]], which make soft, [[probabilistic]] decisions based on attaching [[real-valued]] weights to the features making up the input data. The [[cache language model]]s upon which many [[speech recognition]] systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.

第67行：第67行： −

~~到20世纪80年代，大多数自然语言处理系统仍都依赖于复杂的人制定的规则。然而从20世纪80年代末开始，随着语言处理~~'''机器学习 Machine Learning'''算法的引入，自然语言处理领域掀起了一场革命。这是由于计算能力的稳步增长（参见'''摩尔定律 Moore's Law'''）和'''乔姆斯基语言学理论 Chomskyan Theories of Linguistics的'''~~主导地位逐渐削弱（如~~'''转换语法 Transformational Grammar'''）。乔姆斯基语言学理论并不认同语料库语言学，而'''语料库语言学 Corpus Linguistic'''却是语言处理机器学习方法的基础。一些最早被使用的机器学习算法，比如'''决策树Decision Tree'''~~，产生了使用“如果~~...那么..."(if-then)~~硬判决的系统，这种规则类似于之前人类制定的规则。然而，对~~'''词性标注 Part-of-speech Tagging'''~~的需求使得~~'''隐马尔可夫模型 Hidden Markov Models '''~~被引入到自然语言处理中，并且人们越来越多地将研究重点放在了统计模型上。统计模型将输入数据的各个特征都赋上实值权重，从而做出~~'''软判决 Soft Decision'''和'''概率决策 Probabilistic Decision'''。许多语音识别系统现在所依赖的缓存语言模型就是这种统计模型的例子。这种模型在给定不熟悉的输入，特别是包含错误的输入（在实际数据中这是非常常见的）时，通常更加可靠，并且将多个子任务整合到较大系统中时，能产生更可靠的结果。

+

直到20世纪80年代，大多数自然语言处理系统仍依赖于复杂的、人工制定的规则。然而从20世纪80年代末开始，随着语言处理'''机器学习 Machine Learning'''算法的引入，自然语言处理领域掀起了一场革命。这是由于计算能力的稳步增长（参见'''摩尔定律 Moore's Law'''）和'''乔姆斯基语言学理论 Chomskyan Theories of Linguistics的'''主导地位的削弱（如'''转换语法 Transformational Grammar'''）。乔姆斯基语言学理论并不认同语料库语言学，而'''语料库语言学 Corpus Linguistic'''却是语言处理机器学习方法的基础。一些最早被使用的机器学习算法，比如'''决策树Decision Tree'''，使用“如果...那么..."(if-then)硬判决系统，类似于之前既有的人工制定的规则。然而，'''词性标注 Part-of-speech Tagging'''将'''隐马尔可夫模型 Hidden Markov Models '''引入到自然语言处理中，并且研究重点被放在了统计模型上。统计模型将输入数据的各个特征都赋上实值权重，从而做出'''软判决 Soft Decision'''和'''概率决策 Probabilistic Decision'''。许多语音识别系统现所依赖的缓存语言模型就是这种统计模型的例子。这种模型在给定非预期输入，尤其是包含错误的输入（在实际数据中这是非常常见的），并且将多个子任务整合到较大系统中时，结果通常更加可靠。

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）"对词性标注的需求使得隐马尔可夫模型被引入到自然语言处理中"一句为意译

第75行：第75行：

Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.

−

许多早期瞩目的成功出现在'''机器翻译 Machine Translation'''领域，特别是IBM研究所的工作，他们先后开发了更复杂的统计模型。为了实现将所有行政诉讼翻译成相应政府系统的官方语言的法律要求，加拿大议会和欧盟编制了多语言文本语料库，IBM开发的一些系统能够利用这些语料库。然而大多数其他系统都依赖于专门为这些系统所执行任务开发的语料库，这是并且通常一直是这些系统的一个主要限制。因此，大量的研究开始探寻如何利用有限的数据更有效地学习的方法。

+

许多早期瞩目的成功出现在'''机器翻译 Machine Translation'''领域，特别是IBM研究所的工作，他们先后开发了更复杂的统计模型。这些系统得以利用加拿大议会和欧盟编制的多语言文本语料库，因为法律要求所有行政诉讼必须翻译成相应政府系统官方语言。然而其他大多数系统都必须为所执行的任务专门开发的语料库，这一直是其成功的主要限制因素。因此，大量的研究开始利用有限的数据进行更有效地学习。

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）"这是并且通常一直是这些系统的一个主要限制"为省译

第83行：第83行：

Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.

−

~~最近的研究越来越多地集中在~~'''~~无监督~~ Unsupervised Learning'''和'''半监督学习 Semi-supervised Learning'''~~算法上。这些算法可以利用没有人工标注但有预期答案的数据或使用了标注和未标注兼有的数据学习。一般来说，这个任务要比~~'''监督学习 Supervised Learning'''~~计算困难得多，而且对于给定数量的输入数据，产生的结果通常不那么精确。然而如果所使用的算法具有足够低的~~'''时间复杂度 Time Complexity'''~~，有大量无标注的数据可用（包括其他事物，比如万维网的所有内容）往往可以有效弥补不那么精确的结果。~~

+

近期研究更多地集中在'''无监督学习 Unsupervised Learning'''和'''半监督学习 Semi-supervised Learning'''算法上。这些算法可以从无标注但有预期答案的数据或标注和未标注兼有的数据中学习。一般而言，这种任务比'''监督学习 Supervised Learning'''困难，并且在同量数据下，产生的结果通常不精确。然而如果算法具有较低的'''时间复杂度 Time Complexity'''，且无标注的数据量巨大（包括万维网），可以有效弥补结果不精确的问题。

第98行：第98行：

In the early days, many language-processing systems were designed by hand-coding a set of rules: such as by writing grammars or devising heuristic rules for stemming.

−

在早期，许多语言处理系统是通过人工编码一组规则来设计的: ~~例如通过编写语法或设计~~'''启发式 Heuristic'''~~规则来提取词干。~~

+

在早期，许多语言处理系统是通过人工编码一组规则来设计的: 例如通过编写语法或设计用于词干提取的'''启发式 Heuristic'''规则。

第106行：第106行：

Since the so-called "statistical revolution" in the late 1980s and mid-1990s, much natural language processing research has relied heavily on machine learning. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora (the plural form of corpus, is a set of documents, possibly with human or computer annotations) of typical real-world examples.

−

~~自从20世纪80年代末和90年代中期的“统计革命”以来，许多自然语言处理研究都深度依赖机器学习。机器学习的范式要求通过分析大型语料库~~(corpora,语料库corpus的复数形式，是一组可能带有人或计算机标注的文档)使用统计学推论自动学习这些规则。

+

自20世纪80年代末和90年代中期的“统计革命”以来，许多自然语言处理研究都深度依赖机器学习。机器学习的范式要求通过分析大型语料库(corpora,语料库corpus的复数形式，是一组可能带有人或计算机标注的文档)使用统计学推论自动学习这些规则。

第114行：第114行：

Many different classes of machine-learning algorithms have been applied to natural-language-processing tasks. These algorithms take as input a large set of "features" that are generated from the input data. Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of handwritten rules that were then common. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature. Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.

−

许多不同类型的机器学习算法已被应用在自然语言处理任务中。这些算法将输入数据的大量“特性”作为输入。一些最早被使用的算法，比如'''~~决策树 Decision~~ Tree'''~~，生成了使用“如果~~...那么..."(if-then)硬判决的系统，这种规则类似于很常见的人类制定的规则。然而后来人们越来越多地将研究重点放在了统计模型上。统计模型将输入数据的各个特征都赋上实值权重，从而做出'''软判决 Soft Decision'''和'''概率决策 Probabilistic Decision'''。这种模型的优点是，它们可以表示出许多不同的可能答案的相对确定性，而不仅仅是一个答案。当这种模型作为一个更大系统的模块时，可以产生更可靠的结果。

+

许多不同类型的机器学习算法已被应用在自然语言处理任务中。这些算法将输入数据的大量“特性”作为输入。一些最早被使用的算法，比如'''决策树Decision Tree'''，使用“如果...那么..."(if-then)硬判决系统，类似于之前既有的人工制定的规则。然而后来人们将研究重点聚焦在统计模型上。统计模型将输入数据的各个特征都赋上实值权重，从而做出'''软判决 Soft Decision'''和'''概率决策 Probabilistic Decision'''。这种模型的优点是，它们可以表示出许多不同的可能答案的相对确定性，而不仅仅是一个答案。当这种模型作为一个更大系统的模块时，产生的结果更加可靠。

−

Systems based on machine-learning algorithms have many advantages over hand-produced rules:

第122行：第120行：

Systems based on machine-learning algorithms have many advantages over hand-produced rules:

−

~~基于机器学习算法的系统比起人工制定的规则有许多优点~~:

+

基于机器学习算法的系统比人工制定的规则有许多优点:

*The learning procedures used during machine learning automatically focus on the most common cases, whereas when writing rules by hand it is often not at all obvious where the effort should be directed.

第138行：第136行：

==主要评估及任务(Major evaluations and tasks)==

−

The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in ~~solvi ng~~ larger tasks.

+

The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks.

−

以下列表列出了自然语言处理中一些最常被研究的任务。其中一些任务具有直接的实际应用，而其他任务则通常作为子任务，用于帮助解决更大的任务。

+

以下列出了自然语言处理中最常被研究的任务。其中一些任务在具有直接的实际应用，而其他任务则通常作为子任务，用于帮助解决更大的任务。

第178行：第176行：

Part-of-speech tagging: Given a sentence, determine the part of speech (POS) for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little inflectional morphology, such as English, are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a tonal language during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey the intended meaning.

−

'''词性标注 Part-of-speech Tagging''': 给定一个句子，确定每个词的词性(part of speech, POS)。许多单词，尤其是常见的单词，可以拥有多种词性。例如，“book”可以是名词（书本）(“ the book on the table”)或动词（预订）(“to book a flight”) ; “set”可以是名词、动词或形容词; “out”至少有五种不同的词性。有些语言比其他语言有更多的这种模糊性。像英语这样几乎没有屈折形态的语言尤其容易出现这种歧义。汉语是一种在动词化过程中会变音调的语言，所以容易出现歧义现象。这样的词形变化不容易通过正字法中使用的实体来传达预期的意思。

+

'''词性标注 Part-of-speech Tagging''': 给定一个句子，确定每个词的词性(part of speech, POS)。许多单词，尤其是常见的单词，可以拥有多种词性。例如，“book”可以是名词（书本）(“ the book on the table”)或动词（预订）(“to book a flight”); “set”可以是名词、动词或形容词; “out”至少有五种不同的词性。有些语言比其他语言有更多的这种模糊性。像英语这样几乎没有屈折形态的语言尤其容易出现这种歧义。汉语是一种在动词化过程中会变音调的语言，所以容易出现歧义现象。这样的词形变化不容易通过正字法中使用的实体来传达预期的意思。

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）“‘out’至少有五种不同的词性”一句为意译

第192行：第190行：

Sentence breaking (also known as "sentence boundary disambiguation"): Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g., marking abbreviations).

−

断句('''句子边界消歧 Sentence Boundary Disambiguation''') : 给定一段文本，找到句子边界。句子的边界通常用句号或其他标点符号来标记，但是这些标点符号也会被用于其他目的(例如，标记缩写)。

+

'''断句 Sentence breaking'''(也被称为'''句子边界消歧 Sentence Boundary Disambiguation''') : 给定一段文本，找到句子边界。句子的边界通常用句号或其他标点符号来标记，但是这些标点符号也会被用于其他目的(例如，标记缩写)。

; [[Stemming]]: The process of reducing inflected (or sometimes derived) words to their root form. (''e.g.'', "close" will be the root for "closed", "closing", "close", "closer" etc.).

第204行：第202行：

Word segmentation: Separate a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language. Sometimes this process is also used in cases like bag of words (BOW) creation in data mining.

−

'''分词 Word Segmentation''': 把一块连续的文本分割成单独的词。对于像英语之类的语言来说，这是相当简单的，因为单词通常由空格分隔。然而，如汉语、日语和泰语的文字，并没有以这种方式标记词的边界，而且在这些语言中，文本切分是一项重要的任务，要求掌握语言中词汇和词形的知识。有时这个过程也被用于数据挖掘中创建词包(bag of words，BOW)。

+

'''分词 Word Segmentation''': 把一段连续的文本分割成单独的词语。对于像英语之类的语言是相对简单的，因为单词通常由空格分隔。然而，对于汉语、日语和泰语的文字，并没有类似这种方式的词语边界标记，在这些语言中，文本分词是一项重要的任务，要求掌握语言中词汇和词形的知识。有时这个过程也被用于数据挖掘中创建词包(bag of words，BOW)。

; [[Terminology extraction]]: The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

第230行：第228行：

Machine translation: Automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) to solve properly.

−

'''机器翻译 Machine Translation''': 将文本自动从一种语言翻译成另一种语言。这是最困难的问题之一，也是一类被通俗地称为“人工智能完备”的问题的一部分。需要人类拥有的所有不同类型的知识(语法、语义、对现实世界的事实的认知等)~~才能正确地解决。~~

+

'''机器翻译 Machine Translation''': 将文本从一种语言自动翻译成另一种语言。这是最困难的问题之一，也是“人工智能完备”问题的一部分，即需要人类拥有的所有不同类型的知识(语法、语义、对现实世界的事实的认知等)才能妥善解决。

; [[Named entity recognition]] (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although [[capitalization]] can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient. For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. [[Chinese language|Chinese]] or [[Arabic language|Arabic]]) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, [[German language|German]] capitalizes all [[noun]]s, regardless of whether they are names, and [[French language|French]] and [[Spanish language|Spanish]] do not capitalize names that serve as [[adjective]]s.

第236行：第234行：

Named entity recognition (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient. For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they are names, and French and Spanish do not capitalize names that serve as adjectives.

−

'''命名实体识别 Named entity Recognition, NER''': ~~给定一个文本流，确定文本中的哪些词能映射到适当的名称，如人或地点，以及这些名称的类型~~(例如:人名、地点名、组织名)。虽然大写有助于识别英语等语言中的命名实体，但这种信息无助于确定命名实体的类型，而且大部分时候，这种信息往往是不准确或不充分的。比如说，一个句子的第一个字母也是大写的，命名实体通常跨越几个单词，只有一些是大写的。此外，许多其他非西方文字的语言(~~比如汉语或阿拉伯语~~)根本没有大写，即使是有大写的语言也不一定能用它来区分名字。比如德语不管一个名词是不是名词都将其大写，法语和西班牙语中作为形容词的名称不大写。

+

'''命名实体识别 Named entity Recognition, NER''': 给定一个文本流，确定文本中的哪些词能映射到专有名称，如人或地点，以及这些名称的类型(例如:人名、地点名、组织名)。虽然大写有助于识别英语等语言中的命名实体，但这种信息对于确定命名实体的类型无用，而且，在多数情况下，这种信息是不准确、不充分的。比如，一个句子的第一个字母也是大写的，以及命名实体通常跨越几个单词，只有某些是大写的。此外，许多其他非西方文字的语言(如汉语或阿拉伯语)没有大写，甚至有大写的语言也不一定能用它来区分名字。例如，德语中多有名词都大写，法语和西班牙语中作为形容词的名称不大写。

; [[Natural language generation]]: Convert information from computer databases or semantic intents into readable human language.

第248行：第246行：

Natural language understanding: Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural language concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural language semantics without confusions with implicit assumptions such as closed-world assumption (CWA) vs. open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization.

−

'''自然语言理解 Natural Language Understanding''': ~~将文本块转换成更加有条理的表示形式，比如~~'''一阶逻辑结构 First-order Logic Structure'''，这样计算机程序就更容易处理。自然语言理解涉及到从多种可能的语义中选出预期的语义，这些语义可以由有序符号表现的自然语言表达中派生出来。引入和创建语言元模型和本体是有效但经验化的做法。自然语言语义形式化要求清楚明了，而不能是混有隐含的猜测，如封闭世界假设与开放世界假设、主观的是 / 否与客观的真 / 假。

+

'''自然语言理解 Natural Language Understanding''': 将文本块转换成更加正式的表示形式，比如更易于计算机程序处理的'''一阶逻辑结构 First-order Logic Structure'''。自然语言理解包括从多种可能的语义中识别预期的语义，这些语义可以由有序符号表现的自然语言表达中派生出来。引入和创建语言元模型和本体是有效但经验化的做法。自然语言语义形式化要求清楚明了，而不能是混有隐含的猜测，如封闭世界假设与开放世界假设、主观的是 / 否与客观的真 / 假。

; [[Optical character recognition]] (OCR): Given an image representing printed text, determine the corresponding text.

第254行：第252行：

Optical character recognition (OCR): Given an image representing printed text, determine the corresponding text.

−

'''光学字符识别 Optical Character Recognition,OCR)''' : ~~给定一幅印有文字的图像，确定相应的文本。~~

+

'''光学字符识别 Optical Character Recognition,OCR)''' : 给定一幅印有文字的图像，识别相应的文本。

; [[Question answering]]: Given a human-language question, determine its answer. Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). Recent works have looked at even more complex questions.<ref>{{cite journal |title=Versatile question answering systems: seeing in synthesis |last=Mittal |journal= International Journal of Intelligent Information and Database Systems|volume=5 |issue=2 |pages=119–142 |year=2011 |doi=10.1504/IJIIDS.2011.038968 |url=https://hal.archives-ouvertes.fr/hal-01104648/file/Mittal_VersatileQA_IJIIDS.pdf }}</ref>

第260行：第258行：

Question answering: Given a human-language question, determine its answer. Typical questions have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). Recent works have looked at even more complex questions.

−

问答: 给出一个用人类语言表述的问题，确定它的答案。典型的问题都有一个明确的正确答案（例如“加拿大的首都是哪里? ”），但有时候也需要考虑开放式的问题（比如“生命的意义是什么? ~~”）。最近的工作研究了更加复杂的问题。~~

+

问答: 给出一个用人类语言表述的问题，确定它的答案。典型的问题都有一个明确的正确答案（例如“加拿大的首都是哪里? ”），但有时候也需要考虑开放式的问题（比如“生命的意义是什么? ”）。最近一些工作在研究更复杂的问题。

; [[Textual entailment|Recognizing Textual entailment]]: Given two text fragments, determine if one being true entails the other, entails the other's negation, or allows the other to be either true or false.<ref name=rte:11>PASCAL Recognizing Textual Entailment Challenge (RTE-7) https://tac.nist.gov//2011/RTE/</ref>

第290行：第288行：

Word sense disambiguation: Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or an online resource such as WordNet.

−

'''词义消歧 Word Sense Disambiguation''': ~~许多词有多个意思，我们必须选择最符合上下文的意思。为了解决这个问题，我们通常会从字典或如WordNet的在线资源中取一系列的单词和相关的词义。~~

+

'''词义消歧 Word Sense Disambiguation''': 从词语的多个意思中选出最符合上下文的一个意思。为了解决这个问题，我们通常会从字典或如WordNet的在线资源中取一系列的单词和相关的词义。

第321行：第319行：

Speech recognition: Given a sound clip of a person or people speaking, determine the textual representation of the speech. This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete" (see above). In natural speech there are hardly any pauses between successive words, and thus speech segmentation is a necessary subtask of speech recognition (see below). In most spoken languages, the sounds representing successive letters blend into each other in a process termed coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process. Also, given that words in the same language are spoken by people with different accents, the speech recognition software must be able to recognize the wide variety of input as being identical to each other in terms of its textual equivalent.

−

'''语音识别 Speech Recognition''': ~~给定一个或多个人说话的声音片段，确定语音的文本内容。这是文本转语音的相反过程，是一个极其困难被称为“人工智能完备”~~(见上文)的问题。自然语音中连续的单词之间几乎没有停顿，因此语音分割是语音识别的一个必要的子任务(见下文)。在大多数口语中，连续字母的声音在“协同发音”中相互融合，因此将模拟信号转换为离散字符会是一个非常困难的过程。此外，由于说同一个词时不同人的口音不同，所以语音识别软件必须能够识别文本相同的不同输入。

+

'''语音识别 Speech Recognition''': 给定一个或多个人说话的声音片段，确定语音的文本内容。这是文本转语音的反过程，是一个极其困难被称为“人工智能完备”(见上文)的问题。自然语音中连续的单词之间几乎没有停顿，因此语音分割是语音识别的一个必要的子任务(见下文)。在大多数口语中，连续字母的声音在“协同发音”中相互融合，因此将模拟信号转换为离散字符会是一个非常困难的过程。此外，由于说同一个词时不同人的口音不同，所以语音识别软件必须能够识别文本相同的不同输入。

; [[Speech segmentation]]: Given a sound clip of a person or people speaking, separate it into words. A subtask of [[speech recognition]] and typically grouped with it.

第327行：第325行：

Speech segmentation: Given a sound clip of a person or people speaking, separate it into words. A subtask of speech recognition and typically grouped with it.

−

'''~~语音切分~~ Speech Segmentation''': 给一个人或人说话的声音片段，将其分成单词。这是语音识别的一个子任务，通常两者一起出现。

+

'''语音分割 Speech Segmentation''': 给一个人或人说话的声音片段，将其分成单词。这是语音识别的一个子任务，通常两者一起出现。

; [[Text-to-speech]]:Given a text, transform those units and produce a spoken representation. Text-to-speech can be used to aid the visually impaired.<ref>{{Citation|last=Yi|first=Chucai|title=Assistive Text Reading from Complex Background for Blind Persons|date=2012|work=Camera-Based Document Analysis and Recognition|pages=15–28|publisher=Springer Berlin Heidelberg|language=en|doi=10.1007/978-3-642-29364-1_2|isbn=9783642293634|last2=Tian|first2=Yingli|citeseerx=10.1.1.668.869}}</ref>

第333行：第331行：

Text-to-speech:Given a text, transform those units and produce a spoken representation. Text-to-speech can be used to aid the visually impaired.

−

'''~~文本转语音~~ Text-to-speech ''': ~~给定一个文本，把这些文字转换为声音。文字转语音可以用来帮助视力受损的人。~~

+

'''语音合成 Text-to-speech ''': 给定一个文本，把这些文字转换为口语表达。语音合成可以用来帮助视力受损的人。

第343行：第341行：

The first published work by an artificial intelligence was published in 2018, 1 the Road, marketed as a novel, contains sixty million words.

−

~~第一部人工智能的作品于2018年出版，名为《路》~~(1 the Road) ~~，以小说的形式推出，包含6000万字。~~

+

第一部由人工智能创作的作品于2018年出版，名为《路》(1 the Road) ，以小说的形式发售，包含6000万字。

第363行：第361行：

*[[计算机辅助审查 Computer-assisted reviewing]]

−

*[[~~受限自然语言~~ Controlled natural language]]

+

*[[受控自然语言 Controlled natural language]]

*[[深度学习 Deep learning]]

−

*[[~~深层语言处理~~ Deep linguistic processing]]

+

*[[深度语言处理 Deep linguistic processing]]

*[[分布语义学 Distributional semantics]]

Vicky

99

个编辑

更改

自然语言处理 (查看源代码)

2020年8月23日 (日) 23:55的版本

导航菜单

搜索