首页> 外文期刊>Turkish Journal of Electrical Engineering and Computer Sciences >Selective word encoding for effective text representation
【24h】

Selective word encoding for effective text representation

机译:选择性的字编码,可有效表达文字

获取原文
       

摘要

Determining the category of a text document from its semantic content is highly motivated in the literature and it has been extensively studied in various applications. Also, the compact representation of the text is a fundamental step in achieving precise results for the applications and the studies are generously concentrated to improve its performance. In particular, the studies which exploit the aggregation of word-level representations are the mainstream techniques used in the problem. In this paper, we tackle text representation to achieve high performance in different text classification tasks. Throughout the paper, three critical contributions are presented. First, to encode the word-level representations for each text, we adapt a trainable orderless aggregation algorithm to obtain a more discriminative abstract representation by transforming word vectors to the text-level representation. Second, we propose an effective term-weighting scheme to compute the relative importance of words from the context based on their conjunction with the problem in an end-to-end learning manner. Third, we present a weighted loss function to mitigate the class-imbalance problem between the categories. To evaluate the performance, we collect two distinct datasets as Turkish parliament records (i.e. written speeches of four major political parties including 30731/7683 train and test documents) and newspaper articles (i.e. daily articles of the columnists including 16000/3200 train and test documents) whose data is available on the web. From the results, the proposed method introduces significant performance improvements to the baseline techniques (i.e. VLAD and Fisher Vector) and achieves 0.823 % and 0.878 % true prediction accuracies for the party membership and the estimation of the category of articles respectively. The performance validates that the proposed contributions (i.e. trainable word-encoding model, trainable term-weighting scheme and weighted loss function) significantly outperform the baselines.
机译:从文本内容的语义内容确定文本文件的类别在文献中是非常积极的,并且已经在各种应用中进行了广泛的研究。此外,紧凑的文本表示形式是在应用程序中获得精确结果的基本步骤,并且大量地集中研究以改善其性能。特别地,利用词级表示的聚合的研究是该问题中使用的主流技术。在本文中,我们处理文本表示以在不同的文本分类任务中实现高性能。在整个论文中,提出了三个关键的贡献。首先,为了对每个文本的词级表示进行编码,我们采用了一种可训练的无序聚合算法,通过将词向量转换为文本级表示来获得更具判别性的抽象表示。其次,我们提出了一种有效的术语加权方案,以端到端的学习方式基于单词与问题的结合,从上下文中计算单词的相对重要性。第三,我们提出了加权损失函数来减轻类别之间的类别不平衡问题。为了评估效果,我们收集了两个不同的数据集,分别为土耳其议会记录(即四个主要政党的书面讲话,包括30731/7683火车和测试文件)和报纸文章(即专栏作家的日常文章,包括16000/3200火车和测试文件),其数据可从网络上获取。从结果来看,所提出的方法对基准技术(即VLAD和Fisher Vector)引入了显着的性能改进,并分别实现了0.823%和0.878%的真实预测准确度,可用于党派成员资格和文章类别估计。该性能验证了所提议的贡献(即可训练的单词编码模型,可训练的术语加权方案和加权损失函数)明显优于基线。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号