首页> 外文会议>European conference on IR research >Aggregating Neural Word Embeddings for Document Representation
【24h】

Aggregating Neural Word Embeddings for Document Representation

机译:聚合神经词嵌入用于文档表示

获取原文

摘要

Recent advances in natural language processing (NLP) have shown that semantically meaningful representations of words can be efficiently acquired by distributed models. In such a case, a text document can be viewed as a bag-of-word-embeddings (BoWE), and the remaining question is how to obtain a fixed-length vector representation of the document for efficient document process. Beyond those heuristic aggregation methods, recent work has shown that one can leverage the Fisher kernel (FK) framework to generate document representations based on BoWE in a principled way. In this work, words are embedded into a Euclidean space by latent semantic indexing (LSI), and a Gaussian Mixture Model (GMM) is employed as the generative model for nonlinear FK-based aggregation. In this work, we propose an alternate FK-based aggregation method for document representation based on neural word embeddings. As we know, neural embedding models have been proven significantly better performance in word representations than LSI, where semantic relations between neural word embeddings are typically measured by cosine similarity rather than Euclidean distance. Therefore, we introduce a mixture of Von Mises-Fisher distributions (moVMF) as the generative model of neural word embeddings, and derive a new FK-based aggregation method for document representation based on BoWE. We report document classification, clustering and retrieval experiments and demonstrate that our model can produce state-of-the-art performance as compared with existing baseline methods.
机译:自然语言处理(NLP)的最新进展表明,分布式模型可以有效地获取单词的语义有意义的表示形式。在这种情况下,可以将文本文档视为词袋嵌入(BoWE),剩下的问题是如何获取文档的固定长度矢量表示,以进行有效的文档处理。除了这些启发式聚合方法之外,最近的工作表明,人们可以利用Fisher Fisher(FK)框架以有原则的方式基于BoWE生成文档表示。在这项工作中,单词通过潜在语义索引(LSI)嵌入到欧氏空间中,并且高斯混合模型(GMM)被用作基于FK的非线性聚合的生成模型。在这项工作中,我们提出了一种替代的基于FK的基于神经词嵌入的文档表示聚合方法。众所周知,事实证明,神经嵌入模型在单词表示方面的性能明显优于LSI,后者通常通过余弦相似度而不是欧几里得距离来度量神经单词嵌入之间的语义关系。因此,我们引入了Von Mises-Fisher分布(moVMF)的混合作为神经词嵌入的生成模型,并推导了基于BoWE的基于FK的文档表示新聚集方法。我们报告了文档分类,聚类和检索实验,并证明与现有的基准方法相比,我们的模型可以产生最先进的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号