【24h】

Compressing Word Embeddings

机译:压缩词嵌入

获取原文

摘要

Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using large-scale unlabelled text analysis. However, these representations typically consist of dense vectors that require a great deal of storage and cause the internal structure of the vector space to be opaque. A more 'idealized' representation of a vocabulary would be both compact and readily interpretable. With this goal, this paper first shows that Lloyd's algorithm can compress the standard dense vector representation by a factor of 10 without much loss in performance. Then, using that compressed size as a 'storage budget', we describe a new GPU-friendly factorization procedure to obtain a representation which gains interpretability as a side-effect of being sparse and non-negative in each encoding dimension. Word similarity and word-analogy tests are used to demonstrate the effectiveness of the compressed representations obtained.
机译:用于学习单词的向量空间表示的最新方法已成功使用大规模的未标记文本分析来捕获细粒度的语义和句法规则性。但是,这些表示形式通常由密集的矢量组成,这些矢量需要大量存储并导致矢量空间的内部结构不透明。词汇表的更“理想化”表示既紧凑又易于解释。为此,本文首先证明了劳埃德算法可以将标准的密集矢量表示压缩10倍,而不会造成性能损失。然后,使用该压缩后的大小作为“存储预算”,我们描述了一种新的GPU友好的因式分解程序,以获得一种表示形式,该表示形式在每个编码维中作为稀疏且非负的副作用而具有可解释性。单词相似度和单词相似度测试用于证明所获得的压缩表示形式的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号