Word2Vec

此词条暂由计算社会读书会词条梳理志愿者橙西瓜翻译,翻译字数共1655,未经人工整理和审校,带来阅读不便,请见谅。

模板:Machine learning bar 模板:Confusing Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.

Word2vec 是2013年发布的自然语言处理(NLP)技术。Word2vec 算法利用神经网络模型学习大规模文本语料库中的词语关联。一旦经过训练,这种模型可以检测同义词,或为部分句子建议额外可加的词。顾名思义,Word2vec 是将词语转化为向量,也就是用向量(特定的数字列表)表征不同的词语。这些向量的选择不是随意的,以便用简单的数学函数(向量之间的余弦相似性)就能表示向量所表征的词语之间的语义相似性。

Approach方法

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.[1]

Word2vec 是一组用于生成词嵌入的相关模型。这些模型是浅层的两层神经网络,通过训练以重建单词的上下文。Word2vec 接受一个大型文本语料库作为输入,并生成一个向量空间这个向量空间通常有几百个维度语料库中的每个词在空间中都有一个对应的向量。词向量被放置在向量空间中,使得那些在语料库中共享上下文的词,它们在空间中也更靠近彼此。[1]


History历史

Word2vec was created, patented,[2] and published in 2013 by a team of researchers led by Tomas Mikolov at Google over two papers.[3][4] Other researchers helped analyse and explain the algorithm.[5] Embedding vectors created using the Word2vec algorithm have some advantages compared to earlier algorithms[1]模板:Explain such as latent semantic analysis.

Word2vec 是由谷歌的 Tomas Mikolov 所领导的研究小组于2013年研发出来的,这一成果获得了专利[2],并发表在两篇论文中[3][4]。其他研究人员帮助分析和解释了这个算法。[5] 相比于早期的算法,Word2vec 算法得到的嵌入向量有着更多优势[1] (需要更近一步的解释,如潜在语义分析


CBOW and skip grams

Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words.[1][6] According to the authors' note,[7] CBOW is faster while skip-gram does a better job for infrequent words.

Word2vec 利用两种模型结构来生成词语的分布式表示 distributed representation: continuous bag-of-words(连续的词袋法,CBOW)或 continuous skip-gram。在CBOW中,该模型通过上下文来预测当前词语,相当于从一句话中抠掉一个词,让机器预测这个词是什么。上下文中词的顺序不影响预测结果(词袋假设)。在skip-gram中则相反,是用当前词预测上下文的词是什么,相当于给机器一个词,然后让它猜前面和后面可能出现什么词skip-gram赋予较近的上下文更大的权重。[1][6] 根据作者的说明,[7] CBOW 速度更快,而在处理不常用的词语时,skip-gram做得更好。

Parameterization参数化

Results of word2vec training can be sensitive to parametrization. The following are some important parameters in word2vec training.

Word2vec 训练的结果可能对参数化很敏感。以下是 Word2vec 训练中的一些重要参数。

Training algorithm训练算法

A Word2vec model can be trained with hierarchical softmax and/or negative sampling. To approximate the conditional log-likelihood a model seeks to maximize, the hierarchical softmax method uses a Huffman tree to reduce calculation. The negative sampling method, on the other hand, approaches the maximization problem by minimizing the log-likelihood of sampled negative instances. According to the authors, hierarchical softmax works better for infrequent words while negative sampling works better for frequent words and better with low dimensional vectors.[7] As training epochs increase, hierarchical softmax stops being useful.[8]

可以用hierarchical softmax 和/或负抽样来训练 Word2vec 模型。为了逼近模型寻求最大化的条件对数似然,hierarchical softmax方法使用哈夫曼树来减少计算量。另一方面,负抽样方法通过最小化抽样负实例的对数似然来逼近最大化问题。根据作者的研究,hierarchical softmax极大值对于不常见的词语有较好的效果,而负抽样对于频繁词语和低维向量有较好的效果。[7] 随着训练epoch数的增加,hierarchical softmax 不再有用。[8]


Sub-sampling二次采样

High-frequency words often provide little information. Words with a frequency above a certain threshold may be subsampled to speed up training.[9]

高频词常常提供的信息很少。可以对频率高于某一阈限的词进行二次采样,以加快训练速度。[9]

Dimensionality维数

Quality of word embedding increases with higher dimensionality. But after reaching some point, marginal gain diminishes.[1] Typically, the dimensionality of the vectors is set to be between 100 and 1,000.

词嵌入的质量随着维数的增加而提高。但达到某个点后,边际收益将减少。[1] 通常,向量的维数设置为100到1,000之间。

Context window上下文窗口

The size of the context window determines how many words before and after a given word are included as context words of the given word. According to the authors' note, the recommended value is 10 for skip-gram and 5 for CBOW.[7]

上下文窗口的大小决定了词的上下文范围,也就是在选定词的前后,框定一定范围的词语作为上下文。根据作者的说明,skip-gram的推荐值是10,CBOW的推荐值是5 。[7]

Extensions扩展

An extension of word2vec to construct embeddings from entire documents (rather than the individual words) has been proposed.[10] This extension is called paragraph2vec or doc2vec and has been implemented in the C, Python[11][12] and Java/Scala[13] tools (see below), with the Java and Python versions also supporting inference of document embeddings on new, unseen documents.

目前已经提出了 Word2vec 的扩展,不局限于单个词语,而是可以给整个文档构造嵌入。[10] 这一扩展名为 pagh2vec 或 doc2vec,并在 C、 Python[11][12] 和 Java/Scala[13] 工具中实现(见下文),Java 和 Python 版本也支持在新的,之前没有出现的文档上推断其嵌入。

Word vectors for bioinformatics: BioVectors生物信息学的词向量:BioVectors

An extension of word vectors for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for bioinformatics applications has been proposed by Asgari and Mofrad.[14] Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of machine learning in proteomics and genomics. The results suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.[14] A similar variant, dna2vec, has shown that there is correlation between Needleman-Wunsch similarity score and cosine similarity of dna2vec word vectors.[15]

Asgari 和 Mofrad[14] 提出了在生物序列中基于 n-gram 的词嵌入(例如:DNA、 RNA 和蛋白质),并已被用于生物信息学。命名生物载体(BioVec)泛指生物序列,蛋白质载体(ProtVec)表示蛋白质(氨基酸序列),基因载体(GeneVec)表示基因序列,这种表征可广泛用于蛋白质组学基因组学中的深度学习中。结果表明,BioVectors 可以根据潜在模式的生化和生物物理解释来表征生物序列。[14] 一个类似的变体 dna2vec 表明,Needleman-Wunsch相似度得分和dna2vec 词向量的余弦相似度之间存在相关性。[15]

Word vectors for radiology: Intelligent word embedding (IWE)放射学的词向量:智能词嵌入(IWE)

An extension of word vectors for creating a dense vector representation of unstructured radiology reports has been proposed by Banerjee et al.[16] One of the biggest challenges with Word2vec is how to handle unknown or out-of-vocabulary (OOV) words and morphologically similar words. If the Word2vec model has not encountered a particular word before, it will be forced to use a random vector, which is generally far from its ideal representation. This can particularly be an issue in domains like medicine where synonyms and related words can be used depending on the preferred style of radiologist, and words may have been used infrequently in a large corpus.

Banerjee 等人提出了一个词向量的扩展,用于创建非结构化放射学报告中的密集向量表征[16]Word2vec 最大的挑战之一是处理未知或词汇表外(OOV)的词,以及形态相似的词。如果 Word2vec 模型之前没有遇到过某一特殊的词,那么它将被迫使用一个与理想结果相去甚远的随机向量。在类似医学领域,这一问题尤其突出,因为同义词和相关词的使用受放射科医生个人偏好的左右,与此同时,这些词在大型语料库中可也能很少被使用。

IWE combines Word2vec with a semantic dictionary mapping technique to tackle the major challenges of information extraction from clinical texts, which include ambiguity of free text narrative style, lexical variations, use of ungrammatical and telegraphic phases, arbitrary ordering of words, and frequent appearance of abbreviations and acronyms. Of particular interest, the IWE model (trained on the one institutional dataset) successfully translated to a different institutional dataset which demonstrates good generalizability of the approach across institutions.

IWE 将 Word2vec 与语义词典映射技术相结合,以解决临床文本信息提取中的主要挑战,包括自由文本叙述风格的模糊性、词汇多样性、不合语法和电报阶段的使用、单词的任意排序以及频繁出现的缩写和首字母词。很意思的是,IWE 模型(由一个机构的数据集训练得到)可以成功转化为不同机构的数据集,这表明该方法在不同机构间具有良好的通用性。

Analysis分析

The reasons for successful word embedding learning in the word2vec framework are poorly understood. Goldberg and Levy point out that the word2vec objective function causes words that occur in similar contexts to have similar embeddings (as measured by cosine similarity) and note that this is in line with J. R. Firth's distributional hypothesis. However, they note that this explanation is "very hand-wavy" and argue that a more formal explanation would be preferable.[5]

词嵌入在 Word2vec 框架下行之有效的原因知之甚少。Goldberg 和 Levy 认为,Word2vec 的目标函数能够让在相似语境中出现的词具有相似的嵌入(通过余弦相似性来衡量),并指出这与 J. R. Firth 的分布假说相一致。然而,他们也强调这种解释是“非常随意的”,如果有一个更正式的解释将更好。[5]

Levy et al. (2015)[17] show that much of the superior performance of word2vec or similar embeddings in downstream tasks is not a result of the models per se, but of the choice of specific hyperparameters. Transferring these hyperparameters to more 'traditional' approaches yields similar performances in downstream tasks. Arora et al. (2016)[18] explain word2vec and related algorithms as performing inference for a simple generative model for text, which involves a random walk generation process based upon loglinear topic model. They use this to explain some properties of word embeddings, including their use to solve analogies.

Levy 等人(2015)[17] 认为,Word2vec 或类似嵌入在下游任务中的大部分优异表现不应归功于模型本身,而是源于特定超参数的选择。将这些超参数传输到更“传统”的方法,也能在下游任务中达到类似的性能。Arora等人(2016)[18] 将 Word2vec 和相关算法解释为对简单的文本生成模型进行推理,其中涉及基于对数线性主题模型的随机游走生成过程。他们以此解释词嵌入的一些特性,包括用它来解决类比问题。

Preservation of semantic and syntactic relationships语义和句法关系的保留

The word embedding approach is able to capture multiple different degrees of similarity between words. Mikolov et al. (2013)[19] found that semantic and syntactic patterns can be reproduced using vector arithmetic. Patterns such as "Man is to Woman as Brother is to Sister" can be generated through algebraic operations on the vector representations of these words such that the vector representation of "Brother" - "Man" + "Woman" produces a result which is closest to the vector representation of "Sister" in the model. Such relationships can be generated for a range of semantic relations (such as Country–Capital) as well as syntactic relations (e.g. present tense–past tense).

词嵌入方法能够捕捉词汇间多种不同程度的相似性。Mikolov 等人(2013)[19] 发现,通过向量间的运算可以重现语义和句法关系。比如,“男人对女人和兄弟对姐妹”这一类比关系,可通过“男人”-“女人”≈“兄弟”-“姐妹”的向量运算表示。由此,“兄弟”-“男人”+ “女人”得到的结果最为接近模型中“姐妹”所表示的向量。类似的,还可以生成一系列语义关系(如,国家-首都)以及句法关系(如,现在时-过去时)。

Assessing the quality of a model模型质量的评估

Mikolov et al. (2013)[1] develop an approach to assessing the quality of a word2vec model which draws on the semantic and syntactic patterns discussed above. They developed a set of 8,869 semantic relations and 10,675 syntactic relations which they use as a benchmark to test the accuracy of a model. When assessing the quality of a vector model, a user may draw on this accuracy test which is implemented in word2vec,[20] or develop their own test set which is meaningful to the corpora which make up the model. This approach offers a more challenging test than simply arguing that the words most similar to a given test word are intuitively plausible.[1]

Mikolov 等人(2013)[1] 开发了一种评估 Word2vec 模型质量的方法,该方法正是利用了上述的Word2vec对语义和句法关系的保留。他们共挑选了8,869个语义关系和10,675个句法关系用作测试模型准确性的基准。 在评估向量模型的质量时,用户可以利用这个Word2vec 自带的测试集,[20] 或者生成自己的测试集,也就是对训练自身模型时所用语料库有意义的测试集。而自己生成的测试集是更具挑战性的,不会简单认为与给定测试词最相似的词就是可接受的。[1]

Parameters and model quality参数和模型质量

The use of different model parameters and different corpus sizes can greatly affect the quality of a word2vec model. Accuracy can be improved in a number of ways, including the choice of model architecture (CBOW or Skip-Gram), increasing the training data set, increasing the number of vector dimensions, and increasing the window size of words considered by the algorithm. Each of these improvements comes with the cost of increased computational complexity and therefore increased model generation time.[1]

不同模型参数和不同大小语料库的使用极大影响着 Word2vec 模型的质量。有多种方法可以提高模型的准确性,包括模型结构(CBOW 或 Skip-Gram)的选择、增加训练数据集、增加向量的维度、 增加算法考虑的上下文窗口大小等。每一个改进也相应增加了计算的复杂性,因此生成模型所耗费的时间将更长。[1]

In models using large corpora and a high number of dimensions, the skip-gram model yields the highest overall accuracy, and consistently produces the highest accuracy on semantic relationships, as well as yielding the highest syntactic accuracy in most cases. However, the CBOW is less computationally expensive and yields similar accuracy results.[1]

在使用大型语料库的高维模型中,skip-gram模型的整体准确性最高,在语义关系上始终保有最高的准确性,并且在大多数情况下句法关系的准确性也是最高的。但是,计算成本较低的CBOW 也能达到相似的准确性。[1]

Overall, accuracy increases with the number of words used and the number of dimensions. Mikolov et al.[1] report that doubling the amount of training data results in an increase in computational complexity equivalent to doubling the number of vector dimensions.

总的来说,准确性随着所用词语数量和维度数量的增加而增加。Mikolov 等人[1] 报告说,训练数据量加倍将导致计算复杂度的增加,也相当于向量维度数量的加倍。

Altszyler and coauthors (2017) studied Word2vec performance in two semantic tests for different corpus size.[21] They found that Word2vec has a steep learning curve, outperforming another word-embedding technique (LSA) when it is trained with medium to large corpus size (more than 10 million words). However, with a small training corpus LSA showed better performance. Additionally they show that the best parameter setting depends on the task and the training corpus. Nevertheless, for skip-gram models trained in medium size corpora, with 50 dimensions, a window size of 15 and 10 negative samples seems to be a good parameter setting.

Altszyler 和合著者(2017)在两个不同语料库大小的语义测试中研究了 Word2vec 的性能。[21] 他们发现 Word2vec 具有陡峭的学习曲线,当使用中到大型的语料库(超过1000万词语)训练时,其性能优于另一种词嵌入技术(LSA)。但是,当使用小型的训练语料库时,LSA 表现出了更好的性能。此外,他们发现最佳参数的设置取决于具体任务和训练中用的语料库。然而,对于skip-gram模型来说,在中等大小的语料库中训练时,50维、15的窗口大小以及10个负样本似乎是一个很好的参数设置。

See also另见


  • Autoencoder自编码器
  • Document-term matrix文档-词项矩阵
  • Feature extraction特征提取
  • Feature learning特征学习
  • Neural network language models神经网络语言模型
  • Vector space model向量空间模型
  • Thought vector
  • fastText
  • GloVe
  • Normalized compression distance归一化压缩距离

References

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 Mikolov, Tomas; et al. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781 [cs.CL].
  2. 2.0 2.1 模板:Cite patent
  3. 3.0 3.1 Mikolov, Tomas; et al. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781 [cs.CL].
  4. 4.0 4.1 Mikolov, Tomas (2013). "Distributed representations of words and phrases and their compositionality". Advances in Neural Information Processing Systems. arXiv:1310.4546.
  5. 5.0 5.1 5.2 5.3 Goldberg, Yoav; Levy, Omer (2014). "word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method". arXiv:1402.3722 [cs.CL].
  6. 6.0 6.1 Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. arXiv:1310.4546. Bibcode:2013arXiv1310.4546M.
  7. 7.0 7.1 7.2 7.3 7.4 7.5 "Google Code Archive - Long-term storage for Google Code Project Hosting". code.google.com. Retrieved 13 June 2016.
  8. 8.0 8.1 "Parameter (hs & negative)". Google Groups. Retrieved 13 June 2016.
  9. 9.0 9.1 "Visualizing Data using t-SNE" (PDF). Journal of Machine Learning Research, 2008. Vol. 9, pg. 2595. Retrieved 18 March 2017.
  10. 10.0 10.1 Le, Quoc; et al. (2014). "Distributed Representations of Sentences and Documents". arXiv:1405.4053 [cs.CL].
  11. 11.0 11.1 "Doc2Vec tutorial using Gensim". Retrieved 2 August 2015.
  12. 12.0 12.1 "Doc2vec for IMDB sentiment analysis". GitHub. Retrieved 18 February 2016.
  13. 13.0 13.1 "Doc2Vec and Paragraph Vectors for Classification". Retrieved 13 January 2016.
  14. 14.0 14.1 14.2 14.3 Asgari, Ehsaneddin; Mofrad, Mohammad R.K. (2015). "Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics". PLOS ONE. 10 (11): e0141287. arXiv:1503.05140. Bibcode:2015PLoSO..1041287A. doi:10.1371/journal.pone.0141287. PMC 4640716. PMID 26555596.
  15. 15.0 15.1 Ng, Patrick (2017). "dna2vec: Consistent vector representations of variable-length k-mers". arXiv:1701.06279 [q-bio.QM].
  16. 16.0 16.1 Banerjee, Imon; Chen, Matthew C.; Lungren, Matthew P.; Rubin, Daniel L. (2018). "Radiology report annotation using intelligent word embeddings: Applied to multi-institutional chest CT cohort". Journal of Biomedical Informatics. 77: 11–20. doi:10.1016/j.jbi.2017.11.012. PMC 5771955. PMID 29175548.
  17. 17.0 17.1 Levy, Omer; Goldberg, Yoav; Dagan, Ido (2015). "Improving Distributional Similarity with Lessons Learned from Word Embeddings". Transactions of the Association for Computational Linguistics. Transactions of the Association for Computational Linguistics. 3: 211–225. doi:10.1162/tacl_a_00134.
  18. 18.0 18.1 Arora, S; et al. (Summer 2016). "A Latent Variable Model Approach to PMI-based Word Embeddings". Transactions of the Association for Computational Linguistics. 4: 385–399. doi:10.1162/tacl_a_00106 – via ACLWEB.
  19. 19.0 19.1 Mikolov, Tomas; Yih, Wen-tau; Zweig, Geoffrey (2013). "Linguistic Regularities in Continuous Space Word Representations". HLT-Naacl: 746–751.
  20. 20.0 20.1 "Gensim - Deep learning with word2vec". Retrieved 10 June 2016.
  21. 21.0 21.1 Altszyler, E.; Ribeiro, S.; Sigman, M.; Fernández Slezak, D. (2017). "The interpretation of dream meaning: Resolving ambiguity using Latent Semantic Analysis in a small corpus of text". Consciousness and Cognition. 56: 178–187. arXiv:1610.01520. doi:10.1016/j.concog.2017.09.004. PMID 28943127. S2CID 195347873.

External links外部链接

  • Wikipedia2Vec (introduction) Wikipedia2Vec (简介)

= 外部链接 =

  • Wikipedia2Vec (简介)

Implementations工具

  • C
  • C#
  • Python (Spark)
  • Python (TensorFlow)
  • Python (Gensim)
  • Java/Scala
  • R

Implementations工具

  • C
  • C#
  • Python (Spark)
  • Python (TensorFlow)
  • Python (Gensim)
  • Java/Scala
  • R

模板:Natural Language Processing 模板:Differentiable computing

Category:Free science software Category:Natural language processing toolkits Category:Artificial neural networks Category:Machine learning Category:Semantic relations

类别: 自由科学软件类别: 自然语言处理工具包类别: 人工神经网络类别: 机器学习类别: 语义关系


This page was moved from wikipedia:en:Word2vec. Its edit history can be viewed at Word2Vec/edithistory