A document representation framework with interpretable features using pre-trained word embeddings

Narendra Babu Unnam; P. Krishna Reddy

首页> 外文期刊>International Journal of Data Science and Analytics >A document representation framework with interpretable features using pre-trained word embeddings

【24h】

A document representation framework with interpretable features using pre-trained word embeddings

机译：使用预先训练的Word Embeddings具有可解释功能的文档表示框架

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose an improved framework for document representation using word embeddings. The existing models represent the document as a position vector in the same word embedding space. As a result, they are unable to capture the multiple aspects as well as the broad context in the document. Also, due to their low representational power, existing approaches perform poorly at document classification. Furthermore, the document vectors obtained using such methods have uninterpretable features. In this paper, we propose an improved document representation framework which captures multiple aspects of the document with interpretable features. In this framework, a document is represented in a different feature space by representing each dimension with a potential feature word with relatively high discriminating power. A given document is modeled as the distances between the feature words and the document. To represent a document, we have proposed two criteria for the selection of potential feature words and a distance function to measure the distance between the feature word and the document. Experimental results on multiple datasets show that the proposed model consistently performs better at document classification over the baseline methods. The proposed approach is simple and represents the document with interpretable word features. Overall, the proposed model provides an alternative framework to represent the larger text units with word embeddings and provides the scope to develop new approaches to improve the performance of document representation and its applications.

机译：我们向使用Word Embeddings提出了一个改进的文档表示框架。现有模型将文档代表为嵌入空间相同单词的位置向量。因此，它们无法捕获多个方面以及文档中的广泛上下文。此外，由于它们的代表性低，现有方法在文档分类中表现不佳。此外，使用这些方法获得的文档矢量具有未解释的特征。在本文中，我们提出了一种改进的文档表示框架，其捕获文档的多个方面，具有可解释的功能。在该框架中，通过表示具有相对高的识别力的潜在特征字的每个维度来表示在不同的特征空间中。给定文档被建模为特征单词和文档之间的距离。要代表文档，我们提出了选择潜在的特征单词和距离功能的两个标准，以测量特征词和文档之间的距离。多个数据集上的实验结果表明，所提出的模型在基线方法上一直在文档分类中更好地执行。所提出的方法很简单，并表示具有可解释单词功能的文档。总的来说，所提出的模型提供了一个替代框架，可以表示具有Word Embeddings的较大的文本单元，并提供了开发新方法以提高文档表示及其应用程序的范围。

著录项

来源
《International Journal of Data Science and Analytics》 |2020年第1期|49-64|共16页
作者
Narendra Babu Unnam; P. Krishna Reddy;
展开▼
作者单位

Kohli Centre on Intelligent Systems IIIT Hyderabad Hyderabad India;

Kohli Centre on Intelligent Systems IIIT Hyderabad Hyderabad India;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Text mining; Feature engineering; Document representation; Document classification; Word embeddings;

机译：文字挖掘;功能工程;文件表示;文件分类;Word Embeddings.;

相似文献

外文文献
中文文献
专利

1. A HYBRID WORD EMBEDDING MODEL BASED ON ADMIXTURE OF POISSON-GAMMA LATENT DIRICHLET ALLOCATION MODEL AND DISTRIBUTED WORD-DOCUMENT-TOPIC REPRESENTATION [J] . IBRAHIM BAKARI BALA, MOHD ZAINURI SARINGAT, AIDA MUSTAPHA Journal of Theoretical and Applied Information Technology . 2020,第9期

机译：一种基于泊松 - 伽马潜在Dirichlet分配模型和分布式字文档主题表示的混合词嵌入模型
2. The Locally Weighted Bag of Words Framework for Document Representation [J] . Lebanon Guy, Mao Yi, Dillon Joshua Journal of machine learning research . 2007,第Oct期

机译：用于文档表示的局部加权Words框架
3. Content Tree Word Embedding for document representation [J] . Kamkarhaghighi Mehran, Makrehchi Masoud Expert Systems with Application . 2017,第deca30期

机译：内容树词嵌入，用于文档表示
4. UNT Linguistics at SemEval-2020 Task 12: Linear SVC with Pre-trained Word Embeddings as Document Vectors and Targeted Linguistic Features [C] . Jared Fromknecht, Alexis Palmer International Workshop on Semantic Evaluation . 2020

机译：Semeval-2020的Unt语言学任务12：线性SVC具有预先训练的Word Embeddings作为文档向量和目标语言特征
5. Multi-Word Terminology Extraction and Its Role in Document Embedding [D] . Kulkarni, Jayanth Prakash. 2021

机译：多字术语提取及其在文献嵌入中的作用
6. Protein-Protein Interaction Article Classification Using a Convolutional Recurrent Neural Network with Pre-trained Word Embeddings [O] . Sérgio Matos, Rui Antunes 2017

机译：使用带预训练词嵌入的卷积递归神经网络进行蛋白质与蛋白质相互作用的文章分类
7. Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology Based Representations [O] . Michel, Paul, Ravichander, Abhilasha, Rijhwani, Shruti 2017

机译：Word嵌入的几何有助于文档分类吗？一个基于持久同源的表征案例研究

A document representation framework with interpretable features using pre-trained word embeddings

摘要

著录项

相似文献

相关主题

期刊订阅