...
首页> 外文期刊>International Journal of Data Science and Analytics >A document representation framework with interpretable features using pre-trained word embeddings
【24h】

A document representation framework with interpretable features using pre-trained word embeddings

机译:使用预先训练的Word Embeddings具有可解释功能的文档表示框架

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

We propose an improved framework for document representation using word embeddings. The existing models represent the document as a position vector in the same word embedding space. As a result, they are unable to capture the multiple aspects as well as the broad context in the document. Also, due to their low representational power, existing approaches perform poorly at document classification. Furthermore, the document vectors obtained using such methods have uninterpretable features. In this paper, we propose an improved document representation framework which captures multiple aspects of the document with interpretable features. In this framework, a document is represented in a different feature space by representing each dimension with a potential feature word with relatively high discriminating power. A given document is modeled as the distances between the feature words and the document. To represent a document, we have proposed two criteria for the selection of potential feature words and a distance function to measure the distance between the feature word and the document. Experimental results on multiple datasets show that the proposed model consistently performs better at document classification over the baseline methods. The proposed approach is simple and represents the document with interpretable word features. Overall, the proposed model provides an alternative framework to represent the larger text units with word embeddings and provides the scope to develop new approaches to improve the performance of document representation and its applications.
机译:我们向使用Word Embeddings提出了一个改进的文档表示框架。现有模型将文档代表为嵌入空间相同单词的位置向量。因此,它们无法捕获多个方面以及文档中的广泛上下文。此外,由于它们的代表性低,现有方法在文档分类中表现不佳。此外,使用这些方法获得的文档矢量具有未解释的特征。在本文中,我们提出了一种改进的文档表示框架,其捕获文档的多个方面,具有可解释的功能。在该框架中,通过表示具有相对高的识别力的潜在特征字的每个维度来表示在不同的特征空间中。给定文档被建模为特征单词和文档之间的距离。要代表文档,我们提出了选择潜在的特征单词和距离功能的两个标准,以测量特征词和文档之间的距离。多个数据集上的实验结果表明,所提出的模型在基线方法上一直在文档分类中更好地执行。所提出的方法很简单,并表示具有可解释单词功能的文档。总的来说,所提出的模型提供了一个替代框架,可以表示具有Word Embeddings的较大的文本单元,并提供了开发新方法以提高文档表示及其应用程序的范围。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号