首页> 外文期刊>Information Processing & Management >Exploiting probabilistic topic models to improve text categorization under class imbalance
【24h】

Exploiting probabilistic topic models to improve text categorization under class imbalance

机译:利用概率主题模型改善班级不平衡下的文本分类

获取原文
获取原文并翻译 | 示例
       

摘要

In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting the semantic context in text documents. Specifically, we generate new samples of rare classes (categories with relatively small amount of training data) by using global semantic information of classes represented by probabilistic topic models. In this way, the numbers of samples in different categories can become more balanced and the performance of text categorization can be improved using this transformed data set. Indeed, the proposed method is different from traditional re-sampling methods, which try to balance the number of documents in different classes by re-sampling the documents in rare classes. Such re-sampling methods can cause overfitting. Another benefit of our approach is the effective handling of noisy samples. Since all the new samples are generated by topic models, the impact of noisy samples is dramatically reduced. Finally, as demonstrated by the experimental results, the proposed methods can achieve better performance under class imbalance and is more tolerant to noisy samples.
机译:在文本分类中,不同类别中文档的数量经常是不同的,即,类别分布不平衡。我们提出了一种独特的方法,通过利用文本文档中的语义上下文来改善类不平衡情况下的文本分类。具体来说,我们通过使用概率主题模型表示的类的全局语义信息来生成稀有类(训练数据量相对较小的类)的新样本。这样,使用此变换后的数据集,可以使不同类别中的样本数量变得更加平衡,并且可以提高文本分类的性能。实际上,所提出的方法与传统的重新采样方法不同,传统的重新采样方法试图通过对稀有类中的文档进行重新采样来平衡不同类中的文档数量。这种重新采样方法可能会导致过拟合。我们方法的另一个好处是有效处理有噪声的样本。由于所有新样本都是由主题模型生成的,因此大大降低了嘈杂样本的影响。最后,如实验结果所示,所提出的方法在类不平衡下可以实现更好的性能,并且对噪声样本的容忍度更高。

著录项

  • 来源
    《Information Processing & Management》 |2011年第2期|p.202-214|共13页
  • 作者单位

    School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China;

    School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China;

    Department of Management Science and Information Systems, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901-8554, USA;

    School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China;

    School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    class imbalance; rare class analysis; text categorization; probabilistic topic model; noisy data;

    机译:阶级失衡;稀有阶级分析;文本分类概率主题模型嘈杂的数据;
  • 入库时间 2022-08-17 23:20:18

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号