【24h】

Towards Understanding Linear Word Analogies

机译:理解线性词类比

获取原文

摘要

A surprising property of word vectors is that word analogies can often be solved with vector arithmetic. However, it is unclear why arithmetic operators correspond to non-linear embedding models such as skip-gram with negative sampling (SGNS). We provide a formal explanation of this phenomenon without making the strong assumptions that past theories have made about the vector space and word distribution. Our theory has several implications. Past work has conjectured that linear substructures exist in vector spaces because relations can be represented as ratios; we prove that this holds for SGNS. We provide novel justification for the addition of SGNS word vectors by showing that it automatically down-weights the more frequent word, as weighting schemes do ad hoc. Lastly, we offer an information theoretic interpretation of Euclidean distance in vector spaces, justifying its use in capturing word dissimilarity.
机译:词向量的一个令人惊讶的特性是,通常可以使用向量算术来解决词的类比。但是,尚不清楚为什么算术运算符对应于非线性嵌入模型,例如带有负采样的跳gram(SGNS)。我们对这种现象进行了形式上的解释,而没有做出过去理论对向量空间和单词分布所做的有力假设。我们的理论有几个含义。过去的工作推测向量空间中存在线性子结构,因为关系可以表示为比率。我们证明这适用于SGNS。通过显示SGNS词向量的自动加权,如加权方案确实,我们可以自动降低较频繁的词的权重,从而为添加SGNS词向量提供了新颖的理由。最后,我们提供了向量空间中欧几里得距离的信息理论解释,证明了它在捕获单词不相似性方面的合理性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号