首页> 外文期刊>Information Systems >Word co-occurrence features for text classification
【24h】

Word co-occurrence features for text classification

机译:单词共现功能,用于文本分类

获取原文
获取原文并翻译 | 示例
       

摘要

In this article we propose a data treatment strategy to generate new discriminative features, called compound-features (or c-features), for the sake of text classification. These c-features are composed by terms that co-occur in documents without any restrictions on order or distance between terms within a document. This strategy precedes the classification task, in order to enhance documents with discriminative c-features. The idea is that, when c-features are used in conjunction with single-features, the ambiguity and noise inherent to their bag-of-words representation are reduced. We use c-features composed of two terms in order to make their usage computationally feasible while improving the classifier effectiveness. We test this approach with several classification algorithms and single-label multi-class text collections. Experimental results demonstrated gains in almost all evaluated scenarios, from the simplest algorithms such as fcNN (13% gain in micro-average Fj in the 20 Newsgroups collection) to the most complex one, the state-of-the-art SVM (10% gain in macro-average Ft in the collection OHSUMED).
机译:在本文中,我们提出了一种数据处理策略,以生成新的区分特征,称为复合特征(或c特征),以进行文本分类。这些c功能由文档中共同出现的术语组成,对文档中术语之间的顺序或距离没有任何限制。此策略先于分类任务,以增强具有区分性c功能的文档。这个想法是,当将c功能与单功能结合使用时,其词袋表示所固有的歧义性和噪声会减少。我们使用由两个术语组成的c特征,以使其在计算上可行,同时提高分类器的有效性。我们使用几种分类算法和单标签多类文本集合测试此方法。实验结果表明,在几乎所有评估的场景中,从最简单的算法(例如fcNN(20个新闻组集合中的微观平均Fj增长13%)到最复杂的,最新的SVM(10%)在OHSUMED集合中获得宏观平均Ft)。

著录项

  • 来源
    《Information Systems》 |2011年第5期|p.843-858|共16页
  • 作者单位

    Econolnfo Research, Belo Horizonte, Brazil,Universidade Federal de Minas Cerais, Computer Science Department, Belo Horizonte, Brazil;

    Universidade Federal de Saojoao Del Rei, Computer Science Department, Saojoao Del Rei, Brazil;

    Universidade Federal de Coias, Institute of Informatics, Coiania, Brazil;

    Universidade Federal de Minas Cerais, Computer Science Department, Belo Horizonte, Brazil;

    Universidade Federal de Minas Cerais, Computer Science Department, Belo Horizonte, Brazil;

    Universidade Federal de Minas Cerais, Computer Science Department, Belo Horizonte, Brazil;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    classification; text mining; feature extraction;

    机译:分类;文本挖掘;特征提取;
  • 入库时间 2022-08-18 02:47:59

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号