首页> 外文期刊>Expert Systems with Application >Ensemble multi-label text categorization based on rotation forest and latent semantic indexing
【24h】

Ensemble multi-label text categorization based on rotation forest and latent semantic indexing

机译:基于旋转森林和潜在语义索引的多标签文本分类

获取原文
获取原文并翻译 | 示例
       

摘要

Text categorization has gained increasing popularity in the last years due the explosive growth of multimedia documents. As a document can be associated with multiple non-exclusive categories simultaneously (e.g., Virus, Health, Sports, and Olympic Games), text categorization provides many opportunities for developing novel multi-label learning approaches devoted specifically to textual data. In this paper, we propose an ensemble multi-label classification method for text categorization based on four key ideas: (1) performing Latent Semantic Indexing based on distinct orthogonal projections on lower-dimensional spaces of concepts; (2) random splitting of the vocabulary; (3) document bootstrapping; and (4) the use of BoosTexter as a powerful multi-label base learner for text categorization to simultaneously encourage diversity and individual accuracy in the committee. Diversity of the ensemble is promoted through random splits of the vocabulary that leads to different orthogonal projections on lower-dimensional latent concept spaces. Accuracy of the committee members is promoted through the underlying latent semantic structure uncovered in the text. The combination of both rotation-based ensemble construction and Latent Semantic Indexing projection is shown to bring about significant improvements in terms of Average Precision, Coverage, Ranking loss and One error compared to five state-of-the-art approaches across 14 real-word textual data sets covering a wide variety of topics including health, education, business, science and arts. (C) 2016 Elsevier Ltd. All rights reserved.
机译:近年来,由于多媒体文档的爆炸性增长,文本分类已变得越来越流行。由于文档可以同时与多个非排他性类别(例如病毒,健康,体育和奥运会)相关联,因此文本分类为开发专门用于文本数据的新颖的多标签学习方法提供了许多机会。在本文中,我们基于四个关键思想提出了一种用于文本分类的整体多标签分类方法:(1)在概念的低维空间上基于不同的正交投影执行潜在语义索引; (2)词汇的随机分裂; (3)文件自举; (4)使用BoosTexter作为强大的多标签基础学习器进行文本分类,以同时鼓励委员会中的多样性和个人准确性。通过词汇的随机分裂来促进整体的多样性,这会导致在低维潜在概念空间上的正交投影不同。通过文本中揭示的潜在潜在语义结构,可以提高委员会成员的准确性。与基于14个真实单词的五种最新方法相比,基于旋转的整体结构与潜在语义索引投影的结合显示出在平均精度,覆盖率,排名损失和一个错误方面的显着改善涵盖广泛主题的文本数据集,包括健康,教育,商业,科学和艺术。 (C)2016 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号