为什么要用NCE?
总结是最好的学习方法
在词向量的生成过程中,用的loss函数是NCE或negative sampling,而不是常规的softmax。在《learning tensorflow》这本书中,作者这样说道:but it is sufficient to think of it (NCE) as a sort of efficient approximation to the ordinary softmax function used in classification tasks。由此看来,NCE是softmax的一种近似,但是为什么要做这种近似,而不直接用softmax呢?下边是一个网友的回答,把他的答案贴在下边。
There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).
Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.
A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.
The (a) solution
In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.
The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.
For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.
This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.
Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.
References
[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data
[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.
[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.
热门话题 · · · · · · ( 去话题广场 )
- 我的童年阅读记忆 126.9万次浏览
- 童年涂鸦拾遗 144.3万次浏览
- 给童年时期的自己推荐一本书 112.1万次浏览
- 看展记 1.3亿次浏览
- 现代人的“卡夫卡时刻” 827次浏览
- 一人一杯一口入魂的夏日特饮 新话题 · 8246次浏览