更改

自然语言处理 (查看源代码)

2020年8月13日 (四) 16:32的版本

添加610字节、 2020年8月13日 (四) 16:32

无编辑摘要

第67行：第67行： −

到20世纪80年代，大多数自然语言处理系统仍都依赖于复杂的人制定的规则。然而从20世纪80年代末开始，随着语言处理'''机器学习 Machine Learning'''算法的引入，自然语言处理领域掀起了一场革命。这是由于计算能力的稳步增长（参见'''摩尔定律 Moore's Law'''）和'''乔姆斯基语言学理论 Chomskyan Theories of Linguistics的'''主导地位逐渐削弱（如'''转换语法 Transformational Grammar'''）。乔姆斯基语言学理论并不认同语料库语言学，而'''语料库语言学 Corpus Linguistic'''却是语言处理机器学习方法的基础。一些最早被使用的机器学习算法，比如'''决策树Decision Tree'''，产生了使用“如果...那么..."(if-then)硬判决的系统，这种规则类似于之前人类制定的规则。然而，对'''词性标注 Part-of-speech Tagging'''~~的需求---[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）意译使得~~'''隐马尔可夫模型 Hidden Markov Models '''被引入到自然语言处理中，并且人们越来越多地将研究重点放在了统计模型上。统计模型将输入数据的各个特征都赋上实值权重，从而做出'''软判决 Soft Decision'''和'''概率决策 Probabilistic Decision'''。许多语音识别系统现在所依赖的缓存语言模型就是这种统计模型的例子。这种模型在给定不熟悉的输入，特别是包含错误的输入（在实际数据中这是非常常见的）时，通常更加可靠，并且将多个子任务整合到较大系统中时，能产生更可靠的结果。

+

到20世纪80年代，大多数自然语言处理系统仍都依赖于复杂的人制定的规则。然而从20世纪80年代末开始，随着语言处理'''机器学习 Machine Learning'''算法的引入，自然语言处理领域掀起了一场革命。这是由于计算能力的稳步增长（参见'''摩尔定律 Moore's Law'''）和'''乔姆斯基语言学理论 Chomskyan Theories of Linguistics的'''主导地位逐渐削弱（如'''转换语法 Transformational Grammar'''）。乔姆斯基语言学理论并不认同语料库语言学，而'''语料库语言学 Corpus Linguistic'''却是语言处理机器学习方法的基础。一些最早被使用的机器学习算法，比如'''决策树Decision Tree'''，产生了使用“如果...那么..."(if-then)硬判决的系统，这种规则类似于之前人类制定的规则。然而，对'''词性标注 Part-of-speech Tagging'''的需求使得'''隐马尔可夫模型 Hidden Markov Models '''被引入到自然语言处理中，并且人们越来越多地将研究重点放在了统计模型上。统计模型将输入数据的各个特征都赋上实值权重，从而做出'''软判决 Soft Decision'''和'''概率决策 Probabilistic Decision'''。许多语音识别系统现在所依赖的缓存语言模型就是这种统计模型的例子。这种模型在给定不熟悉的输入，特别是包含错误的输入（在实际数据中这是非常常见的）时，通常更加可靠，并且将多个子任务整合到较大系统中时，能产生更可靠的结果。

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）"对词性标注的需求使得隐马尔可夫模型被引入到自然语言处理中"一句为意译

Many of the notable early successes occurred in the field of [[machine translation]], due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual [[text corpus|textual corpora]] that had been produced by the [[Parliament of Canada]] and the [[European Union]] as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.

第74行：第75行：

Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.

−

许多早期瞩目的成功出现在'''机器翻译 Machine Translation'''领域，特别是IBM研究所的工作，他们先后开发了更复杂的统计模型。为了实现将所有行政诉讼翻译成相应政府系统的官方语言的法律要求，加拿大议会和欧盟编制了多语言文本语料库，IBM开发的一些系统能够利用这些语料库。然而大多数其他系统都依赖于专门为这些系统所执行任务开发的语料库，这是并且通常一直是这些系统的一个主要限制---[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）省译。因此，大量的研究开始探寻如何利用有限的数据更有效地学习的方法。

+

许多早期瞩目的成功出现在'''机器翻译 Machine Translation'''领域，特别是IBM研究所的工作，他们先后开发了更复杂的统计模型。为了实现将所有行政诉讼翻译成相应政府系统的官方语言的法律要求，加拿大议会和欧盟编制了多语言文本语料库，IBM开发的一些系统能够利用这些语料库。然而大多数其他系统都依赖于专门为这些系统所执行任务开发的语料库，这是并且通常一直是这些系统的一个主要限制。因此，大量的研究开始探寻如何利用有限的数据更有效地学习的方法。

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）"这是并且通常一直是这些系统的一个主要限制"为省译

Recent research has increasingly focused on [[unsupervised learning|unsupervised]] and [[semi-supervised learning]] algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than [[supervised learning]], and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the [[World Wide Web]]), which can often make up for the inferior results if the algorithm used has a low enough [[time complexity]] to be practical.

第176行：第178行：

Part-of-speech tagging: Given a sentence, determine the part of speech (POS) for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others. Languages with little inflectional morphology, such as English, are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a tonal language during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey the intended meaning.

−

'''词性标注 Part-of-speech Tagging''': 给定一个句子，确定每个词的词性(part of speech, POS)。许多单词，尤其是常见的单词，可以拥有多种词性。例如，“book”可以是名词（书本）(“ the book on the table”)或动词（预订）(“to book a flight”) ; “set”可以是名词、动词或形容词; ~~“out”至少有五种不同的词性-~~--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）意译。有些语言比其他语言有更多的这种模糊性。像英语这样几乎没有屈折形态的语言尤其容易出现这种歧义。汉语是一种在动词化过程中会变音调的语言，所以容易出现歧义现象。这样的词形变化不容易通过正字法中使用的实体来传达预期的意思。

+

'''词性标注 Part-of-speech Tagging''': 给定一个句子，确定每个词的词性(part of speech, POS)。许多单词，尤其是常见的单词，可以拥有多种词性。例如，“book”可以是名词（书本）(“ the book on the table”)或动词（预订）(“to book a flight”) ; “set”可以是名词、动词或形容词; “out”至少有五种不同的词性。有些语言比其他语言有更多的这种模糊性。像英语这样几乎没有屈折形态的语言尤其容易出现这种歧义。汉语是一种在动词化过程中会变音调的语言，所以容易出现歧义现象。这样的词形变化不容易通过正字法中使用的实体来传达预期的意思。

+

--[[用户:Thingamabob|Thingamabob]]（[[用户讨论:Thingamabob|讨论]]）“‘out’至少有五种不同的词性”一句为意译

; [[Parsing]]: Determine the [[parse tree]] (grammatical analysis) of a given sentence. The [[grammar]] for [[natural language|natural languages]] is [[ambiguous]] and typical sentences have multiple possible analyses: perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human). There are two primary types of parsing: ''dependency parsing'' and ''constituency parsing''. Dependency parsing focuses on the relationships between words in a sentence (marking things like primary objects and predicates), whereas constituency parsing focuses on building out the parse tree using a [[probabilistic context-free grammar]] (PCFG) (see also ''[[stochastic grammar]]'').

第349行：第353行：

*''[[1 the Road]]''

−

*[[Automated essay scoring]]

+

*[[自动作文评分 Automated essay scoring]]

−

*[[Biomedical text mining]]

+

*[[生物医学文本挖掘 Biomedical text mining]]

−

*[[Compound term processing]]

+

*[[复合词处理 Compound term processing]]

−

*[[Computational linguistics]]

+

*[[计算语言学 Computational linguistics]]

−

*[[Computer-assisted reviewing]]

+

*[[计算机辅助审查 Computer-assisted reviewing]]

−

*[[Controlled natural language]]

+

*[[受限自然语言 Controlled natural language]]

−

*[[Deep learning]]

+

*[[深度学习 Deep learning]]

−

*[[Deep linguistic processing]]

+

*[[深层语言处理 Deep linguistic processing]]

−

*[[Distributional semantics]]

+

*[[分布语义学 Distributional semantics]]

−

*[[Foreign language reading aid]]

+

*[[外语阅读助手 Foreign language reading aid]]

−

*[[Foreign language writing aid]]

+

*[[外语写作助手 Foreign language writing aid]]

−

*[[Information extraction]]

+

*[[信息抽取 Information extraction]]

−

*[[Information retrieval]]

+

*[[信息检索 Information retrieval]]

−

*[[Language and Communication Technologies]]

+

*[[语言交流技术 Language and Communication Technologies]]

−

*[[Language technology]]

+

*[[语言技术 Language technology]]

−

*[[Latent semantic indexing]]

+

*[[潜在语义索引 Latent semantic indexing]]

−

*[[Native-language identification]]

+

*[[母语识别 Native-language identification]]

−

*[[Natural language programming]]

+

*[[自然语言编程 Natural language programming]]

−

*[[Natural language user interface|Natural language search]]

+

*[[自然语言用户界面 Natural language user interface| 自然语言搜索 Natural language search]]

−

*[[Query expansion]]

+

*[[拓展查询 Query expansion]]

−

*[[Reification (linguistics)]]

+

*[[语言学 Reification (linguistics)]]

−

*[[Speech processing]]

+

*[[语音处理 Speech processing]]

−

*[[Spoken dialogue system]]

+

*[[语音对话系统 Spoken dialogue system]]

−

*[[Text-proofing]]

+

*[[文字校对 Text-proofing]]

−

*[[Text simplification]]

+

*[[文本简化 Text simplification]]

−

*[[Transformer (machine learning model)]]

+

*[[翻译机 Transformer (机器学习模型 machine learning model)]]

−

*[[Truecasing]]

+

*[[真实大小写处理 Truecasing]]

−

*[[Question answering]]

+

*[[问答 Question answering]]

*[[Word2vec]]

−

~~--[[用户:趣木木|趣木木]]（[[用户讨论:趣木木|讨论]]）see also部分需要进行翻译~~

+

Thingamabob

143

个编辑