词嵌入

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
跳到导航 跳到搜索

此词条暂由计算社会读书会词条梳理志愿者橙西瓜翻译,翻译字数共1291,未经人工整理和审校,带来阅读不便,请见谅。

模板:Machine learning bar

In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.[1] Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers.

自然语言处理(NLP)中,词嵌入Word Embedding是文本分析中词语表征的术语,通常以实向量的形式编码词语含义,由此,向量空间中距离较近的词语,它们的含义也更相似。[1] 通过一系列的语言模型表征学习方法,将词汇表中的单词或短语映射到实向量,可获得词向量。

Methods to generate this mapping include neural networks,[2] dimensionality reduction on the word co-occurrence matrix,[3][4][5] probabilistic models,[6] explainable knowledge base method,[7] and explicit representation in terms of the context in which words appear.[8]

生成以上映射的方法包括神经网络[2]、词共现矩阵降维[3][4][5]、概率模型[6]基于知识的可解释性方法[7]以及词语所在上下文的外显表征[8]

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing[9] and sentiment analysis.[10]

将词和短语嵌入作为底层的输入表征,能有效提高NLP任务(如句法分析[9]情感分析[10])的表现。

Development and history of the approach方法的发展及历史

In Distributional semantics, a quantitative methodological approach to understanding meaning in observed language, word embeddings or semantic vector space models have been used as a knowledge representation for some time.[11] Such models aim to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying idea that "a word is characterized by the company it keeps" was proposed in a 1957 article by John Rupert Firth,[12] but has also roots in the contemporaneous work on search systems [13] and in cognitive psychology.[14]

分布语义学中,作为一种理解观察语言含义的定量方法,词嵌入或语义向量空间模型被用作知识表示knowledge representation已有一段时间了[11]。这类模型旨在根据语言数据大样本中的分布特征,对语言项目之间的语义相似性进行量化和分类。这其中蕴含着“观其伴而知其意义”的思想,它是由弗斯John Rupert Firth[12] 1957年的一篇文章中提出,也植根于同时期的搜索系统工作[13] 和认知心理学[14]

The notion of a semantic space with lexical items (words or multi-word terms) represented as vectors or embeddings is based on the computational challenges of capturing distributional characteristics and using them for practical application to measure similarity between words, phrases, or entire documents. The first generation of semantic space models is the vector space model for information retrieval.[15][16][17] Such vector space models for words and their distributional data implemented in their simplest form results in a very sparse vector space of high dimensionality (cf. Curse of dimensionality). Reducing the number of dimensions using linear algebraic methods such as singular value decomposition then led to the introduction of latent semantic analysis in the late 1980s and the Random indexing approach for collecting word cooccurrence contexts.[18][19][20][21] In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language models" to reduce the high dimensionality of words representations in contexts by "learning a distributed representation for words".[22][23]

用向量或嵌入的方式表征词语对象 (词或多词词语) 的语义空间,这一想法是基于计算挑战中,获取分布特征后,并将其实际用以度量词、短语或整个文档的相似性。第一代语义空间模型是用于信息检索的向量空间模型[15][16][17] 这种以最简单形式实现的词及其分布数据的向量空间模型构成了一个非常稀疏的高维向量空间(或者说:维度灾难)。对此,可使用线性代数方法(比如奇异值分解)降低维度,这引发20世纪80年代末潜在语义分析和收集单词共现的上下文的随机索引方法的推出[18][19][20][21] 2000年,本吉奥 Bengio 等人在一系列论文中提出了“神经概率语言模型”,它通过“学习词语的分布式表示”来降低语境中词语的高维表征。[22][23]

A study published in NeurIPS (NIPS) 2002 introduced the use of both word and document embeddings applying the method of kernel CCA to bilingual (and multi-lingual) corpora, also providing an early example of self-supervised learning of word embeddings [24]

2002年发表在 NeurIPS (NIPS)上的一项研究介绍了词语和文档嵌入的使用,它不仅在双语(和多语言)语料库中应用核心典型相关分析(kernel CCA)方法,同时也提供了词嵌入自我监督学习的早期实例。 [24]

Word embeddings come in two different styles, one in which words are expressed as vectors of co-occurring words, and another in which words are expressed as vectors of linguistic contexts in which the words occur; these different styles are studied in (Lavelli et al., 2004).[25] Roweis and Saul published in Science how to use "locally linear embedding" (LLE) to discover representations of high dimensional data structures.[26] Most new word embedding techniques after about 2005 rely on a neural network architecture instead of more probabilistic and algebraic models, since some foundational work by Yoshua Bengio and colleagues.[27][28]

词嵌入有两种不同的方式,一种是将词语表示为共现词语的向量,另一种是将词语表示为词语所在上下文的向量;这些不同的方法在(Lavelli et al., 2004)被研究[25] 。Roweis 和 Saul 在《科学》杂志上发表了如何使用“局部线性嵌入”(LLE)来发现高维数据结构的文章。[26]自本吉奥Bengio及其同事的一些基础性工作以来,2005年之后大多数新的词嵌入技术都依赖于神经网络结构,而不是更多的概率和代数模型。[27][28]

The approach has been adopted by many research groups after advances around year 2010 had been made on theoretical work on the quality of vectors and the training speed of the model and hardware advances allowed for a broader parameter space to be explored profitably. In 2013, a team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit that can train vector space models faster than the previous approaches. The word2vec approach has been widely used in experimentation and was instrumental in raising interest for word embeddings as a technology, moving the research strand out of specialised research into broader experimentation and eventually paving the way for practical application.[29]

2010年前后,关于向量的质量的理论工作取得了进展,模型在训练速度上的提升以及硬件上的进步,促使人们对更广泛参数空间的探索,词嵌入也被众多研究团队采用。2013年,Tomas Mikolov 领导的 Google 团队创建了 word2vec,相较之前的方法,这个词嵌入工具包的对语义空间模型的训练速度更快。Word2vec 方法不仅被广泛使用,而且激发了人们对词嵌入的兴趣,推动着这一领域从专业研究转向更广泛的实验中,并最终为实际应用铺平道路。[29]

Limitations局限

Traditionally, one of the main limitations of word embeddings (word vector space models in general) is that words with multiple meanings are conflated into a single representation (a single vector in the semantic space). In other words, polysemy and homonymy are not handled properly. For example, in the sentence "The club I tried yesterday was great!", it is not clear if the term club is related to the word sense of a club sandwich, baseball club, clubhouse, golf club, or any other sense that club might have. The necessity to accommodate multiple meanings per word in different vectors (multi-sense embeddings) is the motivation for several contributions in NLP to split single-sense embeddings into multi-sense ones.[30][31]

词嵌入(一般的向量空间模型)的主要不足为,难以将多种含义的词合并为一个表征(语义空间中的单个向量),即意义合并缺陷(meaning conflation deficiency)。换言之,传统的词嵌入无法很好地处理多义词同形异义词。例如,在句子“我昨天接触的俱乐部(club)很棒!”中,分不清“俱乐部”(club)是与夹心三明治(club sandwich)中的“club”含义相近,还是和棒球俱乐部(baseball club)、俱乐部会所(clubhouse)、高尔夫俱乐部(golf club)中的“club”含义更相近,亦或是在表达其他的“club”的含义。所以,将一个词的多种含义嵌入不同的向量(多义嵌入)是很必要的,这驱动着NLP将单义嵌入分解为多义嵌入。[30][31]

Most approaches that produce multi-sense embeddings can be divided into two main categories for their word sense representation, i.e., unsupervised and knowledge-based.[32] Based on word2vec skip-gram, Multi-Sense Skip-Gram (MSSG)[33] performs word-sense discrimination and embedding simultaneously, improving its training time, while assuming a specific number of senses for each word. In the Non-Parametric Multi-Sense Skip-Gram (NP-MSSG) this number can vary depending on each word. Combining the prior knowledge of lexical databases (e.g., WordNet, ConceptNet, BabelNet), word embeddings and word sense disambiguation, Most Suitable Sense Annotation (MSSA)[34] labels word-senses through an unsupervised and knowledge-based approach considering a word's context in a pre-defined sliding window. Once the words are disambiguated, they can be used in a standard word embeddings technique, so multi-sense embeddings are produced. MSSA architecture allows the disambiguation and annotation process to be performed recurrently in a self-improving manner.

多义嵌入方法在词义表征上大体上可分为两类,即无监督嵌入和基于知识嵌入。[32] Multi-Sense Skip-Gram(MSSG)模型[33]word2vec skip-gram 模型的基础上,既可以识别词义也能嵌入词义,尽管这缩短了训练时间,但也对每个词设定了共同的含义数量。而在非参数Multi-Sense Skip-Gram(NP-MSSG)模型中,每个词的含义数量是不同的。结合词库(如,WordNetConcepeptNetBabelNet),加之词语嵌入和词义消歧先验知识,Most Suitable Sense Annotation (MSSA)[34]通过一种无监督的和基于知识的方法,在预定义的滑动窗中框定词的上下文,从而标记词语的含义。一旦消除了词语歧义,便可以在标准的词嵌入方法中使用,最终达到多义的嵌入。MSSA结构以自我优化的方式反复进行消除歧义和注释过程。

The use of multi-sense embeddings is known to improve performance in several NLP tasks, such as part-of-speech tagging, semantic relation identification, semantic relatedness, named entity recognition and sentiment analysis.[35][36]

多义嵌入可以提高多种NLP任务的表现,如词性标注、语义关系识别、语义相关性命名实体识别和情感分析。[35][36]

Recently模板:When, contextually-meaningful embeddings such as ELMo and BERT have been developed. These embeddings use a word's context to disambiguate polysemes. They do so using LSTM and Transformer neural network architectures.

近来,研究者们已开发出了语境含义的嵌入方法,如 ELMo 和 BERT 。这些嵌入方法通过LSTM和Transformer神经网络结构实现,利用词语的上下文来消除多义词的歧义。

For biological sequences: BioVectors生物序列:BioVectors

Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for bioinformatics applications have been proposed by Asgari and Mofrad.[37] Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. The results presented by Asgari and Mofrad[37] suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.

Asgari 和 Mofrad 提出了在生物序列中基于 n-gram 的词嵌入(例如:DNA、 RNA 和蛋白质),并已被用于生物信息学[37] 命名生物载体(BioVec)泛指生物序列,蛋白质载体(ProtVec)表示蛋白质(氨基酸序列),基因载体(GeneVec)表示基因序列,这种表征可广泛用于蛋白质组学基因组学中的深度学习中。Asgari 和 Mofrad 的研究结果[37] 表明,BioVector 可以依据潜在模式的生化和生物物理解释来表征生物序列。

Sentence embeddings句子嵌入

The idea has been extended to embeddings of entire sentences or even documents, e.g. in form of the thought vectors concept. In 2015, some researchers suggested "skip-thought vectors" as a means to improve the quality of machine translation.[38]

嵌入的思想已经拓展到一个句子甚至整篇文档,例如,thought vectors这一概念形式。2015年,一些研究人员提出以“skip-thought vectors”作为提高机器翻译质量的一种方法。[38]

Software软件

Software for training and using word embeddings includes Tomas Mikolov's Word2vec, Stanford University's GloVe,[39] GN-GloVe,[40] Flair embeddings,[35] AllenNLP's ELMo,[41] BERT,[42] fastText, Gensim,[43] Indra[44] and Deeplearning4j. Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.[45]

用于训练和应用词嵌入的软件有托马斯 · 米科洛夫Tomas Mikolov的 Word2vec,斯坦福大学的 GloVe,[39] GN-GloVe,[40] Flair embedding,[35] AllenNLP's ELMo[41] BERT[42] fastTextGensim[43] Indra[44]Deeplearning4j。主成分分析分析法(PCA)和 t 分布随机近邻嵌入(t-SNE)都可被用来降低词向量空间的维数,以及对词嵌入和聚类进行可视化。

Examples of application应用举例

For instance, the fastText is also used to calculate word embeddings for text corpora in Sketch Engine that are available online.[46]

例如,fastText也被用来计算Sketch Engine中在线文本的词嵌入。[46]

See also另见

= 另见 =

References

  1. 1.0 1.1 Jurafsky, Daniel; H. James, Martin (2000). Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, N.J.: Prentice Hall. ISBN 978-0-13-095069-7. https://web.stanford.edu/~jurafsky/slp3/. 
  2. 2.0 2.1 Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Distributed Representations of Words and Phrases and their Compositionality". arXiv:1310.4546 [cs.CL].
  3. 3.0 3.1 Lebret, Rémi; Collobert, Ronan (2013). "Word Emdeddings through Hellinger PCA". Conference of the European Chapter of the Association for Computational Linguistics (EACL). 2014. arXiv:1312.5542. 
  4. 4.0 4.1 Levy, Omer; Goldberg, Yoav (2014). Neural Word Embedding as Implicit Matrix Factorization (PDF). NIPS.
  5. 5.0 5.1 Li, Yitan; Xu, Linli (2015). Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective (PDF). Int'l J. Conf. on Artificial Intelligence (IJCAI).
  6. 6.0 6.1 Globerson, Amir (2007). "Euclidean Embedding of Co-occurrence Data" (PDF). Journal of Machine Learning Research.
  7. 7.0 7.1 Qureshi, M. Atif; Greene, Derek (2018-06-04). "EVE: explainable vector based embedding technique using Wikipedia". Journal of Intelligent Information Systems (in English). 53: 137–165. arXiv:1702.06891. doi:10.1007/s10844-018-0511-x. ISSN 0925-9902. S2CID 10656055.
  8. 8.0 8.1 Levy, Omer; Goldberg, Yoav (2014). Linguistic Regularities in Sparse and Explicit Word Representations (PDF). CoNLL. pp. 171–180.
  9. 9.0 9.1 Socher, Richard; Bauer, John; Manning, Christopher; Ng, Andrew (2013). Parsing with compositional vector grammars (PDF). Proc. ACL Conf.
  10. 10.0 10.1 Socher, Richard; Perelygin, Alex; Wu, Jean; Chuang, Jason; Manning, Chris; Ng, Andrew; Potts, Chris (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank (PDF). EMNLP.
  11. 11.0 11.1 Sahlgren, Magnus. "A brief history of word embeddings".
  12. 12.0 12.1 Firth, J.R. (1957). "A synopsis of linguistic theory 1930–1955". Studies in Linguistic Analysis: 1–32. Reprinted in F.R. Palmer, ed. (1968). Selected Papers of J.R. Firth 1952–1959. London: Longman. 
  13. 13.0 13.1 {cite journal|last=Luhn|first=H.P.|year=1953|title=A New Method of Recording and Searching Information|journal=American Documentation|pages=14–16|doi=doi:10.1002/asi.5090040104}}
  14. 14.0 14.1 Osgood, C.E.; Suci, G.J.; Tannenbaum, P.H. (1957). The Measurement of Meaning.. University of Illinois Press. 
  15. 15.0 15.1 Salton, Gerard (1962). "Some experiments in the generation of word and document associations". Proceeding AFIPS '62 (Fall) Proceedings of the December 4–6, 1962, Fall Joint Computer Conference. AFIPS '62 (Fall): 234–250. doi:10.1145/1461518.1461544. ISBN 9781450378796. S2CID 9937095. Retrieved 18 October 2020.
  16. 16.0 16.1 Salton, Gerard; Wong, A; Yang, C S (1975). "A Vector Space Model for Automatic Indexing". Communications of the Association for Computing Machinery (CACM). 18 (11): 613–620. doi:10.1145/361219.361220. hdl:1813/6057. S2CID 6473756.
  17. 17.0 17.1 Dubin, David (2004). "The most influential paper Gerard Salton never wrote". Retrieved 18 October 2020.
  18. 18.0 18.1 Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000): Random Indexing of Text Samples for Latent Semantic Analysis, Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036. Mahwah, New Jersey: Erlbaum, 2000.
  19. 19.0 19.1 Karlgren, Jussi; Sahlgren, Magnus (2001). Uesaka, Yoshinori; Kanerva, Pentti; Asoh, Hideki (eds.). "From words to understanding". Foundations of Real-World Intelligence. CSLI Publications: 294–308.
  20. 20.0 20.1 Sahlgren, Magnus (2005) An Introduction to Random Indexing, Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, August 16, Copenhagen, Denmark
  21. 21.0 21.1 Sahlgren, Magnus, Holst, Anders and Pentti Kanerva (2008) Permutations as a Means to Encode Order in Word Space, In Proceedings of the 30th Annual Conference of the Cognitive Science Society: 1300–1305.
  22. 22.0 22.1 Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Jauvin, Christian (2003). "A Neural Probabilistic Language Model" (PDF). Journal of Machine Learning Research. 3: 1137–1155.
  23. 23.0 23.1 Bengio, Yoshua; Schwenk, Holger; Senécal, Jean-Sébastien; Morin, Fréderic; Gauvain, Jean-Luc (2006). A Neural Probabilistic Language Model. 194. pp. 137–186. doi:10.1007/3-540-33486-6_6. ISBN 978-3-540-30609-2. 
  24. 24.0 24.1 Vinkourov, Alexei; Cristianini, Nello; Shawe-Taylor, John (2002). Inferring a semantic representation of text via cross-language correlation analysis (PDF). Advances in neural information processing systems 15.
  25. 25.0 25.1 Lavelli, Alberto; Sebastiani, Fabrizio; Zanoli, Roberto (2004). Distributional term representations: an experimental comparison. 13th ACM International Conference on Information and Knowledge Management. pp. 615–624. doi:10.1145/1031171.1031284.
  26. 26.0 26.1 Roweis, Sam T.; Saul, Lawrence K. (2000). "Nonlinear Dimensionality Reduction by Locally Linear Embedding". Science. 290 (5500): 2323–6. Bibcode:2000Sci...290.2323R. CiteSeerX 10.1.1.111.3313. doi:10.1126/science.290.5500.2323. PMID 11125150.
  27. 27.0 27.1 Morin, Fredric; Bengio, Yoshua (2005). "Hierarchical probabilistic neural network language model". Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research. R5. pp. 246–252. http://proceedings.mlr.press/r5/morin05a/morin05a.pdf. 
  28. 28.0 28.1 Mnih, Andriy; Hinton, Geoffrey (2009). "A Scalable Hierarchical Distributed Language Model". Advances in Neural Information Processing Systems 21 (NIPS 2008). Curran Associates, Inc. 21: 1081–1088.
  29. 29.0 29.1 "word2vec". Google Code Archive. Retrieved 23 July 2021.
  30. 30.0 30.1 Reisinger, Joseph; Mooney, Raymond J. (2010). Multi-Prototype Vector-Space Models of Word Meaning. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles, California: Association for Computational Linguistics. pp. 109–117. ISBN 978-1-932432-65-7. https://www.aclweb.org/anthology/N10-1013/. 
  31. 31.0 31.1 Huang, Eric. (2012). Improving word representations via global context and multiple word prototypes. OCLC 857900050. 
  32. 32.0 32.1 Camacho-Collados, Jose; Pilehvar, Mohammad Taher (2018). "From Word to Sense Embeddings: A Survey on Vector Representations of Meaning". arXiv:1805.04032 [cs.CL].
  33. 33.0 33.1 Neelakantan, Arvind; Shankar, Jeevan; Passos, Alexandre; McCallum, Andrew (2014). "Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space". Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics: 1059–1069. arXiv:1504.06654. doi:10.3115/v1/d14-1113. S2CID 15251438.
  34. 34.0 34.1 Ruas, Terry; Grosky, William; Aizawa, Akiko (2019-12-01). "Multi-sense embeddings through a word sense disambiguation process". Expert Systems with Applications. 136: 288–303. arXiv:2101.08700. doi:10.1016/j.eswa.2019.06.026. hdl:2027.42/145475. ISSN 0957-4174. S2CID 52225306.
  35. 35.0 35.1 35.2 35.3 Akbik, Alan; Blythe, Duncan; Vollgraf, Roland (2018). "Contextual String Embeddings for Sequence Labeling". Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics: 1638–1649.
  36. 36.0 36.1 Li, Jiwei; Jurafsky, Dan (2015). "Do Multi-Sense Embeddings Improve Natural Language Understanding?". Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics: 1722–1732. arXiv:1506.01070. doi:10.18653/v1/d15-1200. S2CID 6222768.
  37. 37.0 37.1 37.2 37.3 Asgari, Ehsaneddin; Mofrad, Mohammad R.K. (2015). "Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics". PLOS ONE. 10 (11): e0141287. arXiv:1503.05140. Bibcode:2015PLoSO..1041287A. doi:10.1371/journal.pone.0141287. PMC 4640716. PMID 26555596.
  38. 38.0 38.1 Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard S.; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja (2015). "skip-thought vectors". arXiv:1506.06726 [cs.CL].
  39. 39.0 39.1 "GloVe".
  40. 40.0 40.1 Zhao, Jieyu; et al. (2018) (2018). "Learning Gender-Neutral Word Embeddings". arXiv:1809.01496 [cs.CL].
  41. 41.0 41.1 "Elmo".
  42. 42.0 42.1 Pires, Telmo; Schlinger, Eva; Garrette, Dan (2019-06-04). "How multilingual is Multilingual BERT?". arXiv:1906.01502 [cs.CL].
  43. 43.0 43.1 "Gensim".
  44. 44.0 44.1 "Indra". GitHub. 2018-10-25.
  45. Ghassemi, Mohammad; Mark, Roger; Nemati, Shamim (2015). "A Visualization of Evolving Clinical Sentiment Using Vector Representations of Clinical Notes" (PDF). Computing in Cardiology.
  46. 46.0 46.1 "Embedding Viewer". Embedding Viewer. Lexical Computing. Retrieved 7 Feb 2018.

模板:Natural Language Processing

Category:Language modeling Category:Artificial neural networks Category:Natural language processing Category:Computational linguistics Category:Semantic relations

分类: 语言建模分类: 人工神经网络分类: 自然语言处理分类: 计算语言学分类: 语义关系


This page was moved from wikipedia:en:Word embedding. Its edit history can be viewed at 词嵌入/edithistory