首页> 外文期刊>Information Sciences: An International Journal >Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec
【24h】

Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec

机译:使用各种文档表示的文档分类多联合培训:TF-IDF,LDA和DOC2VEC

获取原文
获取原文并翻译 | 示例
           

摘要

The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency-inverse document frequency (TF-IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions. (C) 2018 Elsevier Inc. All rights reserved.
机译:文档分类的目的是将最合适的标签分配给指定的文档。文档分类中的主要挑战是标签信息不足和非结构化稀疏格式。半监督学习(SSL)方法可能是对前一个问题的有效解决方案,而多个文档表示方案的考虑可以解决后一种问题。共同培训是一种流行的SSL方法,该方法试图在相同示例的特征子集中利用各种视角。在本文中,我们提出了多联合培训(MCT),以提高文件分类的性能。为了增加分类的各种特征集,我们使用三个文档表示方法转换文档:基于单词袋式方案的术语频率 - 逆文档频率(TF-IDF),基于潜在Dirichlet分配的主题分布(LDA)和基于神经网络的文档嵌入作为向量的文档(Doc2VEC)。实验结果表明,所提出的MCT对参数变化具有稳健性,并且在各种条件下优于基准方法。 (c)2018年Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号