Noise Contrastive Estimation (NCE) is a learning procedure that is regularly used to train neural language models, since it avoids the computational bottleneck caused by the output softmax. In this pa per, we attempt to explain some of the weaknesses of this objective function, and to draw directions for further develop ments. Experiments on a small task show the issues raised by the unigram noise distribution, and that a context dependent noise distribution, such as the bigram dis tribution, can solve these issues and pro vide stable and data-efficient learning.
展开▼