第68行: |
第68行: |
| | | |
| | | |
− | 到20世纪80年代,大多数自然语言处理系统仍都依赖于复杂的人制定的规则。然而从20世纪80年代末开始,随着语言处理'''<font color=#ff8000>机器学习 Machine Learning</font>'''算法的引入,自然语言处理领域掀起了一场革命。这是由于计算能力的稳步增长(参见'''<font color=#ff8000>摩尔定律 Moore's Law</font>''')和'''<font color=#ff8000>乔姆斯基语言学理论 Chomskyan Theories of Linguistics</font>的'''主导地位逐渐削弱(如'''<font color=#ff8000>转换语法 Transformational Grammar</font>''')。乔姆斯基语言学理论并不认同语料库语言学,而'''<font color=#ff8000>语料库语言学 Corpus Linguistic</font>'''却是语言处理机器学习方法的基础。一些最早被使用的机器学习算法,比如'''<font color=#ff8000>决策树Decision Tree</font>''',产生了使用“如果...那么..."(if-then)硬判决的系统,这种规则类似于之前人类制定的规则。然而,对<font color=#ff8000>词性标注 Part-of-speech Tagging</font>的需求---[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]])意译使得'''<font color=#ff8000>隐马尔可夫模型 Hidden Markov Models </font>'''被引入到自然语言处理中,并且人们越来越多地将研究重点放在了统计模型上。统计模型将输入数据的各个特征都赋上实值权重,从而做出'''<font color=#ff8000>软判决 Soft Decision</font>'''和'''<font color=#ff8000>概率决策 Probabilistic Decision</font>'''。许多语音识别系统现在所依赖的缓存语言模型就是这种统计模型的例子。这种模型在给定不熟悉的输入,特别是包含错误的输入(在实际数据中这是非常常见的)时,通常更加可靠,并且将多个子任务整合到较大系统中时,能产生更可靠的结果。 | + | 到20世纪80年代,大多数自然语言处理系统仍都依赖于复杂的人制定的规则。然而从20世纪80年代末开始,随着语言处理'''<font color=#ff8000>机器学习 Machine Learning</font>'''算法的引入,自然语言处理领域掀起了一场革命。这是由于计算能力的稳步增长(参见'''<font color=#ff8000>摩尔定律 Moore's Law</font>''')和'''<font color=#ff8000>乔姆斯基语言学理论 Chomskyan Theories of Linguistics</font>的'''主导地位逐渐削弱(如'''<font color=#ff8000>转换语法 Transformational Grammar</font>''')。乔姆斯基语言学理论并不认同语料库语言学,而'''<font color=#ff8000>语料库语言学 Corpus Linguistic</font>'''却是语言处理机器学习方法的基础。一些最早被使用的机器学习算法,比如'''<font color=#ff8000>决策树Decision Tree</font>''',产生了使用“如果...那么..."(if-then)硬判决的系统,这种规则类似于之前人类制定的规则。然而,对'''<font color=#ff8000>词性标注 Part-of-speech Tagging</font>'''的需求---[[用户:Thingamabob|Thingamabob]]([[用户讨论:Thingamabob|讨论]])意译使得'''<font color=#ff8000>隐马尔可夫模型 Hidden Markov Models </font>'''被引入到自然语言处理中,并且人们越来越多地将研究重点放在了统计模型上。统计模型将输入数据的各个特征都赋上实值权重,从而做出'''<font color=#ff8000>软判决 Soft Decision</font>'''和'''<font color=#ff8000>概率决策 Probabilistic Decision</font>'''。许多语音识别系统现在所依赖的缓存语言模型就是这种统计模型的例子。这种模型在给定不熟悉的输入,特别是包含错误的输入(在实际数据中这是非常常见的)时,通常更加可靠,并且将多个子任务整合到较大系统中时,能产生更可靠的结果。 |
| | | |
| | | |
第97行: |
第97行: |
| In the early days, many language-processing systems were designed by hand-coding a set of rules: such as by writing grammars or devising heuristic rules for stemming. | | In the early days, many language-processing systems were designed by hand-coding a set of rules: such as by writing grammars or devising heuristic rules for stemming. |
| | | |
− | 在早期,许多语言处理系统是通过人工编码一组规则来设计的: 例如通过编写语法或设计<font color=#ff8000>启发式</font>规则来提取词干。 | + | 在早期,许多语言处理系统是通过人工编码一组规则来设计的: 例如通过编写语法或设计'''<font color=#ff8000>启发式 Heuristic</font>'''规则来提取词干。 |
| | | |
| | | |
第171行: |
第171行: |
| Morphological segmentation: Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e., the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g., "open, opens, opened, opening") as separate words. In languages such as Turkish or Meitei, a highly agglutinated Indian language, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms. | | Morphological segmentation: Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e., the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g., "open, opens, opened, opening") as separate words. In languages such as Turkish or Meitei, a highly agglutinated Indian language, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms. |
| | | |
− | '''<font color=#ff8000>语素切分 Morphological Segmentation</font>''': 将单词分成独立的'''<font color=#ff8000>语素 Morpheme</font>''',并确定语素的类别。这项任务的难度很大程度上取决于所考虑的语言的形态(即句子的结构)的复杂性。英语有相当简单的语素,特别是<font color=#ff8000>屈折语素 Inflectional Morphology</font>,因此通常可以完全忽略这个任务,而简单地将一个单词的所有可能形式(例如,"open,opens,opened,opening")作为单独的单词。然而,在诸如土耳其语或曼尼普尔语这样的语言中,这种方法是不可取的,因为每个词都有成千上万种可能的词形。 | + | '''<font color=#ff8000>语素切分 Morphological Segmentation</font>''': 将单词分成独立的'''<font color=#ff8000>语素 Morpheme</font>''',并确定语素的类别。这项任务的难度很大程度上取决于所考虑的语言的形态(即句子的结构)的复杂性。英语有相当简单的语素,特别是'''<font color=#ff8000>屈折语素 Inflectional Morphology</font>''',因此通常可以完全忽略这个任务,而简单地将一个单词的所有可能形式(例如,"open,opens,opened,opening")作为单独的单词。然而,在诸如土耳其语或曼尼普尔语这样的语言中,这种方法是不可取的,因为每个词都有成千上万种可能的词形。 |
| | | |
| ; [[Part-of-speech tagging]]: Given a sentence, determine the [[part of speech]] (POS) for each word. Many words, especially common ones, can serve as multiple [[parts of speech]]. For example, "book" can be a [[noun]] ("the book on the table") or [[verb]] ("to book a flight"); "set" can be a [[noun]], [[verb]] or [[adjective]]; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others.{{dubious|date=June 2018}} Languages with little [[inflectional morphology]], such as [[English language|English]], are particularly prone to such ambiguity. [[Chinese language|Chinese]] is prone to such ambiguity because it is a [[tonal language]] during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey the intended meaning. | | ; [[Part-of-speech tagging]]: Given a sentence, determine the [[part of speech]] (POS) for each word. Many words, especially common ones, can serve as multiple [[parts of speech]]. For example, "book" can be a [[noun]] ("the book on the table") or [[verb]] ("to book a flight"); "set" can be a [[noun]], [[verb]] or [[adjective]]; and "out" can be any of at least five different parts of speech. Some languages have more such ambiguity than others.{{dubious|date=June 2018}} Languages with little [[inflectional morphology]], such as [[English language|English]], are particularly prone to such ambiguity. [[Chinese language|Chinese]] is prone to such ambiguity because it is a [[tonal language]] during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey the intended meaning. |
第233行: |
第233行: |
| Named entity recognition (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient. For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they are names, and French and Spanish do not capitalize names that serve as adjectives. | | Named entity recognition (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case, is often inaccurate or insufficient. For example, the first letter of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they are names, and French and Spanish do not capitalize names that serve as adjectives. |
| | | |
− | <font color=#ff8000>命名实体识别 Named entity Recognition, NER</font>: 给定一个文本流,确定文本中的哪些词能映射到适当的名称,如人或地点,以及这些名称的类型(例如:人名、地点名、组织名)。虽然大写有助于识别英语等语言中的命名实体,但这种信息无助于确定命名实体的类型,而且大部分时候,这种信息往往是不准确或不充分的。比如说,一个句子的第一个字母也是大写的,命名实体通常跨越几个单词,只有一些是大写的。此外,许多其他非西方文字的语言(比如汉语或阿拉伯语)根本没有大写,即使是有大写的语言也不一定能用它来区分名字。比如德语不管一个名词是不是名词都将其大写,法语和西班牙语中作为形容词的名称不大写。 | + | '''<font color=#ff8000>命名实体识别 Named entity Recognition, NER</font>''': 给定一个文本流,确定文本中的哪些词能映射到适当的名称,如人或地点,以及这些名称的类型(例如:人名、地点名、组织名)。虽然大写有助于识别英语等语言中的命名实体,但这种信息无助于确定命名实体的类型,而且大部分时候,这种信息往往是不准确或不充分的。比如说,一个句子的第一个字母也是大写的,命名实体通常跨越几个单词,只有一些是大写的。此外,许多其他非西方文字的语言(比如汉语或阿拉伯语)根本没有大写,即使是有大写的语言也不一定能用它来区分名字。比如德语不管一个名词是不是名词都将其大写,法语和西班牙语中作为形容词的名称不大写。 |
| | | |
| ; [[Natural language generation]]: Convert information from computer databases or semantic intents into readable human language. | | ; [[Natural language generation]]: Convert information from computer databases or semantic intents into readable human language. |
第245行: |
第245行: |
| Natural language understanding: Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural language concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural language semantics without confusions with implicit assumptions such as closed-world assumption (CWA) vs. open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization. | | Natural language understanding: Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural language concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural language semantics without confusions with implicit assumptions such as closed-world assumption (CWA) vs. open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization. |
| | | |
− | '''<font color=#ff8000>自然语言理解 Natural Language Understanding</font>: 将文本块转换成更加有条理的表示形式,比如'''<font color=#ff8000>一阶逻辑结构 First-order Logic Structure</font>''',这样计算机程序就更容易处理。自然语言理解涉及到从多种可能的语义中选出预期的语义,这些语义可以由有序符号表现的自然语言表达中派生出来。引入和创建语言元模型和本体是有效但经验化的做法。自然语言语义<font color=#32cd32>形式化</font>要求清楚明了,而不能是混有隐含的猜测,如封闭世界假设与开放世界假设、主观的是 / 否与客观的真 / 假。 | + | '''<font color=#ff8000>自然语言理解 Natural Language Understanding</font>''': 将文本块转换成更加有条理的表示形式,比如'''<font color=#ff8000>一阶逻辑结构 First-order Logic Structure</font>''',这样计算机程序就更容易处理。自然语言理解涉及到从多种可能的语义中选出预期的语义,这些语义可以由有序符号表现的自然语言表达中派生出来。引入和创建语言元模型和本体是有效但经验化的做法。自然语言语义<font color=#32cd32>形式化</font>要求清楚明了,而不能是混有隐含的猜测,如封闭世界假设与开放世界假设、主观的是 / 否与客观的真 / 假。 |
| | | |
| ; [[Optical character recognition]] (OCR): Given an image representing printed text, determine the corresponding text. | | ; [[Optical character recognition]] (OCR): Given an image representing printed text, determine the corresponding text. |