更改

添加39,127字节、 2022年7月4日 (一) 10:43

Moved page from wikipedia:en:Word embedding (history)

此词条暂由彩云小译翻译，翻译字数共1291，未经人工整理和审校，带来阅读不便，请见谅。

{{Short description|Method in natural language processing}}
{{machine learning bar}}

In [[natural language processing]] (NLP), '''word embedding''' is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.<ref>{{cite book |last1=Jurafsky |first1=Daniel |last2=H. James |first2=Martin |title=Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition |date=2000 |publisher=Prentice Hall |location=Upper Saddle River, N.J. |isbn=978-0-13-095069-7 |url=https://web.stanford.edu/~jurafsky/slp3/}}</ref> Word embeddings can be obtained using a set of [[language model]]ing and [[feature learning]] techniques where words or phrases from the vocabulary are mapped to [[vector (mathematics)|vectors]] of [[real numbers]].

In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers.

在自然语言处理(NLP)中，词语嵌入是一个用于文本分析的词语表示的术语，通常以实值向量的形式对词语的意义进行编码，使得在向量空间中距离较近的词语在意义上相似。通过一系列的语言建模和特征学习技术，将词汇表中的单词或短语映射到实数向量，可以实现单词的嵌入。

Methods to generate this mapping include [[neural net language model|neural networks]],<ref>{{cite arXiv|eprint=1310.4546|last1=Mikolov|first1=Tomas|title=Distributed Representations of Words and Phrases and their Compositionality|last2=Sutskever|first2=Ilya|last3=Chen|first3=Kai|last4=Corrado|first4=Greg|last5=Dean|first5=Jeffrey|class=cs.CL|year=2013}}</ref> [[dimensionality reduction]] on the word [[co-occurrence matrix]],<ref>{{Cite book|arxiv=1312.5542|last1=Lebret|first1=Rémi|chapter=Word Emdeddings through Hellinger PCA|title=Conference of the European Chapter of the Association for Computational Linguistics (EACL)|volume=2014|last2=Collobert|first2=Ronan|year=2013}}</ref><ref>{{Cite conference|url=http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf|title=Neural Word Embedding as Implicit Matrix Factorization|last1=Levy|first1=Omer|conference=NIPS|year=2014|last2=Goldberg|first2=Yoav}}</ref><ref>{{Cite conference|url=http://ijcai.org/papers15/Papers/IJCAI15-513.pdf|title=Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective|last1=Li|first1=Yitan|conference=Int'l J. Conf. on Artificial Intelligence (IJCAI)|year=2015|last2=Xu|first2=Linli}}</ref> probabilistic models,<ref>{{Cite journal|last=Globerson|first=Amir|date=2007|title=Euclidean Embedding of Co-occurrence Data|url=http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34951.pdf|journal=Journal of Machine Learning Research}}</ref> explainable knowledge base method,<ref>{{Cite journal|last1=Qureshi|first1=M. Atif|last2=Greene|first2=Derek|date=2018-06-04|title=EVE: explainable vector based embedding technique using Wikipedia|journal=Journal of Intelligent Information Systems|volume=53|pages=137–165|language=en|doi=10.1007/s10844-018-0511-x|issn=0925-9902|arxiv=1702.06891|s2cid=10656055}}</ref> and explicit representation in terms of the context in which words appear.<ref>{{cite conference|last1=Levy|first1=Omer|last2=Goldberg|first2=Yoav|title=Linguistic Regularities in Sparse and Explicit Word Representations|conference=CoNLL|pages=171–180|year=2014|url=https://levyomer.files.wordpress.com/2014/04/linguistic-regularities-in-sparse-and-explicit-word-representations-conll-2014.pdf}}</ref>

Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, explainable knowledge base method, and explicit representation in terms of the context in which words appear.

生成这种映射的方法包括神经网络、基于单词共现矩阵的降维、概率模型、可解释的知识库方法以及单词出现的上下文的显式表示。

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as [[syntactic parsing]]<ref>{{cite conference|last1=Socher|first1=Richard|last2=Bauer|first2=John|last3=Manning|first3=Christopher|last4=Ng|first4=Andrew|title=Parsing with compositional vector grammars|conference=Proc. ACL Conf.|year=2013|url=http://www.socher.org/uploads/Main/SocherBauerManningNg_ACL2013.pdf}}</ref> and [[sentiment analysis]].<ref>{{cite conference|last1=Socher|first1=Richard|last2=Perelygin|first2=Alex|last3=Wu|first3=Jean|last4=Chuang|first4=Jason|last5=Manning|first5=Chris|last6=Ng|first6=Andrew|last7=Potts|first7=Chris|title=Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank|conference=EMNLP|year=2013|url=http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf}}</ref>

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing and sentiment analysis.

词和短语嵌入作为基础输入表示，已被证明可以提高自然语言处理任务(如句法分析和情感分析)的性能。

==Development and history of the approach==
In [[Distributional semantics]], a quantitative methodological approach to understanding meaning in observed language, word embeddings or semantic vector space models have been used as a knowledge representation for some time.<ref>{{cite web|url=https://www.linkedin.com/pulse/brief-history-word-embeddings-some-clarifications-magnus-sahlgren/|first=Magnus|last=Sahlgren|title=A brief history of word embeddings}}</ref> Such models aim to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying idea that "a word is characterized by the company it keeps" was proposed in a 1957 article by [[J. R. Firth|John Rupert Firth]],<ref>{{cite journal|last=Firth|first=J.R.|year=1957|title=A synopsis of linguistic theory 1930–1955|journal=Studies in Linguistic Analysis|pages=1–32}} Reprinted in {{cite book|editor=F.R. Palmer|title=Selected Papers of J.R. Firth 1952–1959|publisher=London: Longman|year=1968}}</ref> but has also roots in the contemporaneous work on search systems <ref>{cite journal|last=Luhn|first=H.P.|year=1953|title=A New Method of Recording and Searching Information|journal=American Documentation|pages=14–16|doi=doi:10.1002/asi.5090040104}}</ref> and in cognitive psychology.<ref>{{cite book|title=The Measurement of Meaning.|year=1957|last1=Osgood|first1=C.E.|last2=Suci|first2=G.J.|last3=Tannenbaum|first3=P.H.|publisher=University of Illinois Press}}</ref>

In Distributional semantics, a quantitative methodological approach to understanding meaning in observed language, word embeddings or semantic vector space models have been used as a knowledge representation for some time. Such models aim to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying idea that "a word is characterized by the company it keeps" was proposed in a 1957 article by John Rupert Firth, Reprinted in but has also roots in the contemporaneous work on search systems {cite journal|last=Luhn|first=H.P.|year=1953|title=A New Method of Recording and Searching Information|journal=American Documentation|pages=14–16|doi=doi:10.1002/asi.5090040104}} and in cognitive psychology.

在分布式语义学中，一种用于理解观察语言、词汇嵌入或语义向量空间模型中的意义的定量方法已经作为一种知识表示方法使用了一段时间。这些模型旨在根据语言项目在大样本语言数据中的分布特征，对语言项目之间的语义相似性进行量化和分类。1957年，拥有属性约翰·鲁伯特·弗斯在一篇文章中提出了“一个词与它所拥有的伙伴是一体的”这一基本概念，该文章被转载，但同时也植根于同时期的搜索系统工作(引用期刊 | last = Luhn | first = H.P. | year = 1953 | title = 一种记录和搜索信息的新方法 | journal = American document | pages = 14-16 | doi = doi: 10.1002/asi. 5090040104})和认知心理学。

The notion of a semantic space with lexical items (words or multi-word terms) represented as vectors or embeddings is based on the computational challenges of capturing distributional characteristics and using them for practical application to measure similarity between words, phrases, or entire documents. The first generation of semantic space models is the [[vector space model]] for information retrieval.<ref name="Salton original">{{cite journal |last1=Salton |first1=Gerard |title=Some experiments in the generation of word and document associations |journal=Proceeding AFIPS '62 (Fall) Proceedings of the December 4–6, 1962, Fall Joint Computer Conference |series=AFIPS '62 (Fall) |date=1962 |pages=234–250 |doi=10.1145/1461518.1461544 |isbn=9781450378796 |s2cid=9937095 |url=https://dl.acm.org/doi/10.1145/1461518.1461544 |access-date=18 October 2020}}</ref><ref name="SaltonEA CACM">{{cite journal |last1=Salton |first1=Gerard |last2=Wong |first2=A |last3=Yang |first3=C S |title=A Vector Space Model for Automatic Indexing |journal=Communications of the Association for Computing Machinery (CACM) |date=1975 |volume=18 |issue=11 |pages=613–620|doi=10.1145/361219.361220 |hdl=1813/6057 |s2cid=6473756 |hdl-access=free }}</ref><ref>{{cite web |last1=Dubin |first1=David |title=The most influential paper Gerard Salton never wrote. |url=https://www.thefreelibrary.com/The+most+influential+paper+Gerard+Salton+never+wrote.-a0125151308 |access-date=18 October 2020 |date=2004}}</ref> Such vector space models for words and their distributional data implemented in their simplest form results in a very sparse vector space of high dimensionality (cf. [[Curse of dimensionality]]). Reducing the number of dimensions using linear algebraic methods such as [[Singular-value decomposition|singular value decomposition]] then led to the introduction of [[latent semantic analysis]] in the late 1980s and the [[Random indexing]] approach for collecting word cooccurrence contexts.<ref>Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000): [https://cloudfront.escholarship.org/dist/prd/content/qt5644k0w6/qt5644k0w6.pdf Random Indexing of Text Samples for Latent Semantic Analysis], Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036. Mahwah, New Jersey: Erlbaum, 2000.</ref><ref>{{cite journal |last1=Karlgren |first1=Jussi |last2=Sahlgren |first2=Magnus |editor1-last=Uesaka |editor1-first=Yoshinori |editor2-last=Kanerva |editor2-first=Pentti |editor3-last=Asoh |editor3-first=Hideki |title=From words to understanding |journal=Foundations of Real-World Intelligence |date=2001 |pages=294–308 |publisher=CSLI Publications}}</ref><ref>Sahlgren, Magnus (2005) [http://eprints.sics.se/221/1/RI_intro.pdf An Introduction to Random Indexing], Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, August 16, Copenhagen, Denmark</ref><ref>Sahlgren, Magnus, Holst, Anders and Pentti Kanerva (2008) [http://eprints.sics.se/3436/01/permutationsCogSci08.pdf Permutations as a Means to Encode Order in Word Space], In Proceedings of the 30th Annual Conference of the Cognitive Science Society: 1300–1305.</ref> In 2000 [[Yoshua Bengio|Bengio]] et al. provided in a series of papers the "Neural probabilistic language models" to reduce the high dimensionality of words representations in contexts by "learning a distributed representation for words".<ref>{{cite journal|last1=Bengio|first1=Yoshua|author-link1=Yoshua Bengio|last2=Ducharme|first2=Réjean|last3=Vincent|first3=Pascal|last4=Jauvin|first4=Christian|year=2003|title=A Neural Probabilistic Language Model|url=https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf|journal=Journal of Machine Learning Research|volume=3|pages=1137–1155}}</ref><ref>{{cite book|title=A Neural Probabilistic Language Model|doi=10.1007/3-540-33486-6_6|journal=Studies in Fuzziness and Soft Computing|volume=194|pages=137–186|year=2006|last1=Bengio|first1=Yoshua|last2=Schwenk|first2=Holger|last3=Senécal|first3=Jean-Sébastien|last4=Morin|first4=Fréderic|last5=Gauvain|first5=Jean-Luc|isbn=978-3-540-30609-2 }}</ref>

The notion of a semantic space with lexical items (words or multi-word terms) represented as vectors or embeddings is based on the computational challenges of capturing distributional characteristics and using them for practical application to measure similarity between words, phrases, or entire documents. The first generation of semantic space models is the vector space model for information retrieval. Such vector space models for words and their distributional data implemented in their simplest form results in a very sparse vector space of high dimensionality (cf. Curse of dimensionality). Reducing the number of dimensions using linear algebraic methods such as singular value decomposition then led to the introduction of latent semantic analysis in the late 1980s and the Random indexing approach for collecting word cooccurrence contexts.Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000): Random Indexing of Text Samples for Latent Semantic Analysis, Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036. Mahwah, New Jersey: Erlbaum, 2000.Sahlgren, Magnus (2005) An Introduction to Random Indexing, Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, August 16, Copenhagen, DenmarkSahlgren, Magnus, Holst, Anders and Pentti Kanerva (2008) Permutations as a Means to Encode Order in Word Space, In Proceedings of the 30th Annual Conference of the Cognitive Science Society: 1300–1305. In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language models" to reduce the high dimensionality of words representations in contexts by "learning a distributed representation for words".

词项(词或多词词汇)表示为向量或嵌入的语义空间的概念是基于捕获分布特征的计算挑战，并将其用于实际应用，以度量词、短语或整个文档之间的相似性。第一代语义空间模型是向量空间模型的信息检索。这种以最简单的形式实现的词及其分布数据的向量空间模型导致了一个非常稀疏的高维向量空间(cf。维数灾难)。使用线性代数方法(比如奇异值分解)减少维数，然后在20世纪80年代末引入了潜在语义学，并采用随机索引方法收集单词共现上下文。Kanerva，Pentti，Kristoferson，Jan and Holst，Anders (2000) : 潜在语义学文本样本的随机索引，《认知科学学会第22届年会会报》，第1036页。莫沃: 厄尔鲍姆，2000年。Sahlgren，Magnus (2005)随机索引介绍，第七届国际术语和知识工程会议上的语义索引方法和应用研讨会论文集，TKE 2005,8月16日，哥本哈根，丹麦 Sahlgren，Magnus，Holst，Anders 和 Pentti Kanerva (2008)排列作为一种手段在词空间编码顺序，在第30届认知科学学会会议论文集: 1300-1305。2000年，Bengio 等人。在一系列论文中提出了“神经概率语言模型”，通过“学习词汇的分布式表示”来降低词汇在语境中的高维表示。

A study published in NeurIPS (NIPS) 2002 introduced the use of both word and document embeddings applying the method of kernel CCA to bilingual (and multi-lingual) corpora, also providing an early example of [[self-supervised learning]] of word embeddings <ref>{{cite conference|year=2002|last1=Vinkourov|first1=Alexei|last2=Cristianini|first2=Nello|last3=Shawe-Taylor|first3=John|title=Inferring a semantic representation of text via cross-language correlation analysis.|conference=Advances in neural information processing systems 15
|url=https://proceedings.neurips.cc/paper/2002/file/d5e2fbef30a4eb668a203060ec8e5eef-Paper.pdf }}</ref>

A study published in NeurIPS (NIPS) 2002 introduced the use of both word and document embeddings applying the method of kernel CCA to bilingual (and multi-lingual) corpora, also providing an early example of self-supervised learning of word embeddings

2002年发表在 NeurIPS (NIPS)上的一项研究介绍了在双语(和多语言)语料库中应用核心共同语言分析(kernel CCA)方法进行词汇和文档嵌入的情况，同时也提供了词汇嵌入自我监督学习的早期实例

Word embeddings come in two different styles, one in which words are expressed as vectors of co-occurring words, and another in which words are expressed as vectors of linguistic contexts in which the words occur; these different styles are studied in (Lavelli et al., 2004).<ref>{{cite conference|year=2004|last1=Lavelli|first1=Alberto|last2=Sebastiani|first2=Fabrizio|last3=Zanoli|first3=Roberto|title=Distributional term representations: an experimental comparison|conference=13th ACM International Conference on Information and Knowledge Management|pages=615–624|doi=10.1145/1031171.1031284 }}</ref> Roweis and Saul published in ''Science'' how to use "[[Nonlinear dimensionality reduction#Locally-linear embedding|locally linear embedding]]" (LLE) to discover representations of high dimensional data structures.<ref>{{cite journal|title=Nonlinear Dimensionality Reduction by Locally Linear Embedding|journal=Science|volume=290|issue=5500|pages=2323–6|bibcode=2000Sci...290.2323R|last1=Roweis|first1=Sam T.|last2=Saul|first2=Lawrence K.|year=2000|doi=10.1126/science.290.5500.2323|pmid=11125150|citeseerx=10.1.1.111.3313}}</ref> Most new word embedding techniques after about 2005 rely on a [[neural network]] architecture instead of more probabilistic and algebraic models, since some foundational work by Yoshua Bengio and colleagues.<ref>{{cite book|last1=Morin|first1=Fredric|last2=Bengio|first2=Yoshua|chapter=Hierarchical probabilistic neural network language model|chapter-url=http://proceedings.mlr.press/r5/morin05a/morin05a.pdf|title=Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics|series=Proceedings of Machine Learning Research|volume=R5|pages=246–252|year=2005 |editor1 = Cowell, Robert G. |editor2=Ghahramani, Zoubin}}</ref><ref>{{cite journal|last1=Mnih|first1=Andriy|last2=Hinton|first2=Geoffrey|title=A Scalable Hierarchical Distributed Language Model|pages=1081–1088|url=http://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model|journal=Advances in Neural Information Processing Systems 21 (NIPS 2008)|publisher=Curran Associates, Inc.|year=2009|volume=21}}</ref>

Word embeddings come in two different styles, one in which words are expressed as vectors of co-occurring words, and another in which words are expressed as vectors of linguistic contexts in which the words occur; these different styles are studied in (Lavelli et al., 2004). Roweis and Saul published in Science how to use "locally linear embedding" (LLE) to discover representations of high dimensional data structures. Most new word embedding techniques after about 2005 rely on a neural network architecture instead of more probabilistic and algebraic models, since some foundational work by Yoshua Bengio and colleagues.

词语嵌入有两种不同的风格，一种是将词语表示为共现词语的向量，另一种是将词语表示为词语出现的语境的向量。Roweis 和 Saul 在《科学》杂志上发表了如何使用“局部线性嵌入”(LLE)来发现高维数据结构的表示。大约在2005年之后，大多数新的单词嵌入技术依赖于神经网络结构，而不是更多的概率和代数模型，因为一些基础工作由 YoshuaBengio 和他的同事。

The approach has been adopted by many research groups after advances around year 2010 had been made on theoretical work on the quality of vectors and the training speed of the model and hardware advances allowed for a broader parameter space to be explored profitably. In 2013, a team at [[Google]] led by [[Tomas Mikolov]] created [[word2vec]], a word embedding toolkit that can train vector space models faster than the previous approaches. The word2vec approach has been widely used in experimentation and was instrumental in raising interest for word embeddings as a technology, moving the research strand out of specialised research into broader experimentation and eventually paving the way for practical application.<ref>{{cite web |title=word2vec |url=https://code.google.com/archive/p/word2vec/|website=Google Code Archive |access-date=23 July 2021}}</ref>

The approach has been adopted by many research groups after advances around year 2010 had been made on theoretical work on the quality of vectors and the training speed of the model and hardware advances allowed for a broader parameter space to be explored profitably. In 2013, a team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit that can train vector space models faster than the previous approaches. The word2vec approach has been widely used in experimentation and was instrumental in raising interest for word embeddings as a technology, moving the research strand out of specialised research into broader experimentation and eventually paving the way for practical application.

2010年前后，在矢量质量的理论工作方面取得了进展，模型的训练速度和硬件进步使得能够有利地探索更广泛的参数空间，许多研究小组采用了这种方法。2013年，Tomas Mikolov 领导的 Google 团队创建了 word2vec，这是一个单词嵌入工具包，可以比以前的方法更快地训练向量空间模型。Word2vec 方法已被广泛应用于实验中，并有助于提高人们对嵌入词作为一种技术的兴趣，将研究从专业研究转移到更广泛的实验中，并最终为实际应用铺平道路。

==Limitations==
Traditionally, one of the main limitations of word embeddings (word [[vector space model]]s in general) is that words with multiple meanings are conflated into a single representation (a single vector in the semantic space). In other words, [[polysemy]] and [[homonym|homonymy]] are not handled properly. For example, in the sentence "The club I tried yesterday was great!", it is not clear if the term ''club'' is related to the word sense of a ''[[club sandwich]]'', ''[[baseball|baseball club]]'', ''[[Meeting house|clubhouse]]'', ''[[golf club]]'', or any other sense that ''club'' might have. The necessity to accommodate multiple meanings per word in different vectors (multi-sense embeddings) is the motivation for several contributions in NLP to split single-sense embeddings into multi-sense ones.<ref>{{Cite book|url=https://www.aclweb.org/anthology/N10-1013/|title=Multi-Prototype Vector-Space Models of Word Meaning|last1=Reisinger|first1=Joseph|last2=Mooney|first2=Raymond J.|date=2010|publisher=Association for Computational Linguistics|isbn=978-1-932432-65-7|volume=Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics|location=Los Angeles, California|pages=109–117|access-date=October 25, 2019}}</ref><ref>{{Cite book|title=Improving word representations via global context and multiple word prototypes|last=Huang, Eric.|date=2012|oclc=857900050}}</ref>

Traditionally, one of the main limitations of word embeddings (word vector space models in general) is that words with multiple meanings are conflated into a single representation (a single vector in the semantic space). In other words, polysemy and homonymy are not handled properly. For example, in the sentence "The club I tried yesterday was great!", it is not clear if the term club is related to the word sense of a club sandwich, baseball club, clubhouse, golf club, or any other sense that club might have. The necessity to accommodate multiple meanings per word in different vectors (multi-sense embeddings) is the motivation for several contributions in NLP to split single-sense embeddings into multi-sense ones.

传统上，单词嵌入(一般的单词向量空间模型)的主要限制之一是将具有多种意义的单词合并为一个表示(语义空间中的单个向量)。换句话说，多义词和同形异义词处理不当。例如，在句子“我昨天尝试的俱乐部是伟大的！”，目前还不清楚这个术语是否与俱乐部三明治、棒球俱乐部、俱乐部会所、高尔夫俱乐部或俱乐部可能具有的任何其他意义相关。自然语言处理(NLP)需要在不同的向量中容纳每个词的多重意义(多义嵌入) ，这是 NLP 将单义嵌入分解为多义嵌入的动机。

Most approaches that produce multi-sense embeddings can be divided into two main categories for their word sense representation, i.e., unsupervised and knowledge-based.<ref>{{cite arXiv|last1=Camacho-Collados|first1=Jose|last2=Pilehvar|first2=Mohammad Taher|year=2018|title=From Word to Sense Embeddings: A Survey on Vector Representations of Meaning|class=cs.CL|eprint=1805.04032}}</ref> Based on [[word2vec]] skip-gram, Multi-Sense Skip-Gram (MSSG)<ref>{{Cite journal|last1=Neelakantan|first1=Arvind|last2=Shankar|first2=Jeevan|last3=Passos|first3=Alexandre|last4=McCallum|first4=Andrew|date=2014|title=Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space|journal=Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)|pages=1059–1069|location=Stroudsburg, PA, USA|publisher=Association for Computational Linguistics|doi=10.3115/v1/d14-1113|arxiv=1504.06654|s2cid=15251438}}</ref> performs word-sense discrimination and embedding simultaneously, improving its training time, while assuming a specific number of senses for each word. In the Non-Parametric Multi-Sense Skip-Gram (NP-MSSG) this number can vary depending on each word. Combining the prior knowledge of lexical databases (e.g., [[WordNet]], [[Open Mind Common Sense|ConceptNet]], [[BabelNet]]), word embeddings and [[word sense disambiguation]], Most Suitable Sense Annotation (MSSA)<ref>{{Cite journal|last1=Ruas|first1=Terry|last2=Grosky|first2=William|last3=Aizawa|first3=Akiko|date=2019-12-01|title=Multi-sense embeddings through a word sense disambiguation process|journal=Expert Systems with Applications|volume=136|pages=288–303|doi=10.1016/j.eswa.2019.06.026|arxiv=2101.08700|issn=0957-4174|hdl=2027.42/145475|s2cid=52225306|hdl-access=free}}</ref> labels word-senses through an unsupervised and knowledge-based approach considering a word's context in a pre-defined sliding window. Once the words are disambiguated, they can be used in a standard word embeddings technique, so multi-sense embeddings are produced. MSSA architecture allows the disambiguation and annotation process to be performed recurrently in a self-improving manner.

Most approaches that produce multi-sense embeddings can be divided into two main categories for their word sense representation, i.e., unsupervised and knowledge-based. Based on word2vec skip-gram, Multi-Sense Skip-Gram (MSSG) performs word-sense discrimination and embedding simultaneously, improving its training time, while assuming a specific number of senses for each word. In the Non-Parametric Multi-Sense Skip-Gram (NP-MSSG) this number can vary depending on each word. Combining the prior knowledge of lexical databases (e.g., WordNet, ConceptNet, BabelNet), word embeddings and word sense disambiguation, Most Suitable Sense Annotation (MSSA) labels word-senses through an unsupervised and knowledge-based approach considering a word's context in a pre-defined sliding window. Once the words are disambiguated, they can be used in a standard word embeddings technique, so multi-sense embeddings are produced. MSSA architecture allows the disambiguation and annotation process to be performed recurrently in a self-improving manner.

大多数产生多义嵌入的方法在词义表示上可以分为两大类，即无监督嵌入和基于知识嵌入。多义跳跃图(Multi-Sense Skip-Gram，MSSG)以 word2vec 跳跃图为基础，同时进行词义识别和嵌入，提高了训练时间，同时对每个词假定了特定的词义数目。在非参数多义跳跃图(NP-MSSG)中，这个数字可以根据每个单词而变化。结合词汇数据库的先验知识(例如，WordNet，ConcepeptNet，BabelNet) ，词语嵌入和词义消歧，最适合的意义注释(MSSA)通过一种无监督的和基于知识的方法，在预定义的滑动窗口中考虑一个词的上下文，标记词义。一旦消除了词语的歧义，就可以在标准的词语嵌入技术中使用，从而产生多义嵌入。MSSA 体系结构允许以自我改进的方式反复执行消除歧义和注释过程。

The use of multi-sense embeddings is known to improve performance in several NLP tasks, such as [[part-of-speech tagging]], semantic relation identification, [[semantic relatedness]], [[named entity recognition]] and sentiment analysis.<ref name=":1">{{Cite journal|last1=Akbik|first1=Alan|last2=Blythe|first2=Duncan|last3=Vollgraf|first3=Roland|date=2018|title=Contextual String Embeddings for Sequence Labeling|url=https://www.aclweb.org/anthology/C18-1139|journal=Proceedings of the 27th International Conference on Computational Linguistics|location=Santa Fe, New Mexico, USA|publisher=Association for Computational Linguistics|pages=1638–1649}}</ref><ref>{{Cite journal|last1=Li|first1=Jiwei|last2=Jurafsky|first2=Dan|date=2015|title=Do Multi-Sense Embeddings Improve Natural Language Understanding?|journal=Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing|pages=1722–1732|location=Stroudsburg, PA, USA|publisher=Association for Computational Linguistics|doi=10.18653/v1/d15-1200|arxiv=1506.01070|s2cid=6222768}}</ref>

The use of multi-sense embeddings is known to improve performance in several NLP tasks, such as part-of-speech tagging, semantic relation identification, semantic relatedness, named entity recognition and sentiment analysis.

多义嵌入可以提高自然语言处理任务的性能，如词性标注、语义关系识别、语义相关性、命名实体识别和情感分析。

Recently{{When|date=June 2022}}, contextually-meaningful embeddings such as [[ELMo]] and [[BERT (language model)|BERT]] have been developed. These embeddings use a word's context to disambiguate polysemes. They do so using [[Long short-term memory|LSTM]] and [[Transformer (machine learning model)|Transformer]] neural network architectures.

Recently, contextually-meaningful embeddings such as ELMo and BERT have been developed. These embeddings use a word's context to disambiguate polysemes. They do so using LSTM and Transformer neural network architectures.

最近，语境有意义的嵌入，如 ELMo 和 BERT 已经开发出来。这些嵌入使用单词的上下文来消除多义词的歧义。他们这样做使用 LSTM 和变压器神经网络结构。

==For biological sequences: BioVectors==
Word embeddings for ''n-''grams in biological sequences (e.g. DNA, RNA, and Proteins) for [[bioinformatics]] applications have been proposed by Asgari and Mofrad.<ref name=":0">{{cite journal|last1=Asgari|first1=Ehsaneddin|last2=Mofrad|first2=Mohammad R.K.|title=Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics|journal=PLOS ONE|date=2015|volume=10|issue=11|page=e0141287|doi=10.1371/journal.pone.0141287|pmid=26555596|pmc=4640716|bibcode=2015PLoSO..1041287A|arxiv=1503.05140|doi-access=free}}</ref> Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in [[proteomics]] and [[genomics]]. The results presented by Asgari and Mofrad<ref name=":0"/> suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.

Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for bioinformatics applications have been proposed by Asgari and Mofrad. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. The results presented by Asgari and Mofrad suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.

= = 对于生物序列: BioVector = = 生物序列中 n-gram 的单词嵌入(例如:。DNA、 RNA 和蛋白质)用于生物信息学的应用已经被 Asgari 和 Mofrad 提出。命名生物载体(BioVec)是指蛋白质载体(ProtVec)用于蛋白质(氨基酸序列)和基因载体(GeneVec)用于基因序列的一般生物序列，这种表示可广泛应用于蛋白质组学和基因组学中的深度学习应用。Asgari 和 Mofrad 提供的结果表明，BioVector 可以通过对潜在模式的生物化学和生物物理解释来表征生物序列。

==Sentence embeddings==
{{Main|Sentence embedding}}
The idea has been extended to embeddings of entire sentences or even documents, e.g. in form of the [[thought vector]]s concept. In 2015, some researchers suggested "skip-thought vectors" as a means to improve the quality of [[machine translation]].<ref>{{cite arXiv|title=skip-thought vectors|eprint=1506.06726|last1=Kiros|first1=Ryan|last2=Zhu|first2=Yukun|last3=Salakhutdinov|first3=Ruslan|last4=Zemel|first4=Richard S.|last5=Torralba|first5=Antonio|last6=Urtasun|first6=Raquel|last7=Fidler|first7=Sanja|class=cs.CL|year=2015}}</ref>

The idea has been extended to embeddings of entire sentences or even documents, e.g. in form of the thought vectors concept. In 2015, some researchers suggested "skip-thought vectors" as a means to improve the quality of machine translation.

这个想法已经扩展到整个句子甚至文件的嵌入，例如。以思维向量概念的形式。2015年，一些研究人员提出“跳过思维向量”作为提高机器翻译质量的一种手段。

==Software==
Software for training and using word embeddings includes Tomas Mikolov's [[Word2vec]], Stanford University's [[GloVe (machine learning)|GloVe]],<ref>{{cite web|url=http://nlp.stanford.edu/projects/glove/|title=GloVe}}</ref> GN-GloVe,<ref name="gn-glove">{{cite arXiv|last=Zhao|first=Jieyu|collaboration=2018|title=Learning Gender-Neutral Word Embeddings|year=2018|eprint=1809.01496|class=cs.CL}}</ref> Flair embeddings,<ref name=":1" /> AllenNLP's [[ELMo]],<ref>{{cite web|url=https://allennlp.org/elmo|title=Elmo}}</ref> [[BERT (language model)|BERT]],<ref>{{cite arXiv|last1=Pires|first1=Telmo|last2=Schlinger|first2=Eva|last3=Garrette|first3=Dan|date=2019-06-04|title=How multilingual is Multilingual BERT?|class=cs.CL|eprint=1906.01502}}</ref> [[fastText]], [[Gensim]],<ref>{{cite web|url=http://radimrehurek.com/gensim/|title=Gensim}}</ref> Indra<ref>{{cite web|url=https://github.com/Lambda-3/Indra|title=Indra|website=[[GitHub]]|date=2018-10-25}}</ref> and [[Deeplearning4j]]. [[Principal Component Analysis]] (PCA) and [[T-Distributed Stochastic Neighbour Embedding]] (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and [[Word-sense induction|clusters]].<ref>{{Cite journal|last1=Ghassemi|first1=Mohammad|last2=Mark|first2=Roger|last3=Nemati|first3=Shamim|date=2015|title=A Visualization of Evolving Clinical Sentiment Using Vector Representations of Clinical Notes|url=http://www.cinc.org/archives/2015/pdf/0629.pdf|journal=Computing in Cardiology}}</ref>

Software for training and using word embeddings includes Tomas Mikolov's Word2vec, Stanford University's GloVe, GN-GloVe, Flair embeddings, AllenNLP's ELMo, BERT, fastText, Gensim, Indra and Deeplearning4j. Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.

用于训练和使用单词嵌入的软件包括托马斯 · 米科洛夫的 Word2vec，斯坦福大学的 GloVe，GN-GloVe，Flair 嵌入，艾伦 NLP 的 ELMo，BERT，fast Text，Gensim，Indra 和 Deeplearning4j。主成分分析分析(PCA)和 t 分布式随机邻居嵌入(t-SNE)都被用来降低词向量空间的维数和可视化词的嵌入和聚类。

===Examples of application===
For instance, the fastText is also used to calculate word embeddings for [[text corpora]] in [[Sketch Engine]] that are available online.<ref>{{cite web|url=https://embeddings.sketchengine.co.uk/|title=Embedding Viewer|author=|website=Embedding Viewer|publisher=Lexical Computing|access-date=7 Feb 2018}}</ref>

For instance, the fastText is also used to calculate word embeddings for text corpora in Sketch Engine that are available online.

例如，fastText 还用于计算 Sketch Engine 中可在线获得的文本语料库的单词嵌入。

==See also==
* [[Brown clustering]]
* [[Distributional–relational database]]

* Brown clustering
* Distributional–relational database

= = 另见 = =
* 布朗聚类
* 分布式关系数据库

==References==
{{Reflist}}

{{Natural Language Processing}}

[[Category:Language modeling]]
[[Category:Artificial neural networks]]
[[Category:Natural language processing]]
[[Category:Computational linguistics]]
[[Category:Semantic relations]]

Category:Language modeling
Category:Artificial neural networks
Category:Natural language processing
Category:Computational linguistics
Category:Semantic relations

分类: 语言建模分类: 人工神经网络分类: 自然语言处理分类: 计算语言学分类: 语义关系

<noinclude>

<small>This page was moved from [[wikipedia:en:Word embedding]]. Its edit history can be viewed at [[词嵌入/edithistory]]</small></noinclude>

[[Category:待整理页面]]

Moonscar

1,564

个编辑

更改

词嵌入 (查看源代码)

2022年7月4日 (一) 10:43的版本

导航菜单

搜索