Exploiting probabilistic topic models to improve text categorization under class imbalance

Enhong Chen; Yanggang Lin; Hui Xiong; Qiming Luo; Haiping Ma

首页> 外文期刊>Information Processing & Management >Exploiting probabilistic topic models to improve text categorization under class imbalance

【24h】

Exploiting probabilistic topic models to improve text categorization under class imbalance

机译：利用概率主题模型改善班级不平衡下的文本分类

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting the semantic context in text documents. Specifically, we generate new samples of rare classes (categories with relatively small amount of training data) by using global semantic information of classes represented by probabilistic topic models. In this way, the numbers of samples in different categories can become more balanced and the performance of text categorization can be improved using this transformed data set. Indeed, the proposed method is different from traditional re-sampling methods, which try to balance the number of documents in different classes by re-sampling the documents in rare classes. Such re-sampling methods can cause overfitting. Another benefit of our approach is the effective handling of noisy samples. Since all the new samples are generated by topic models, the impact of noisy samples is dramatically reduced. Finally, as demonstrated by the experimental results, the proposed methods can achieve better performance under class imbalance and is more tolerant to noisy samples.

机译：在文本分类中，不同类别中文档的数量经常是不同的，即，类别分布不平衡。我们提出了一种独特的方法，通过利用文本文档中的语义上下文来改善类不平衡情况下的文本分类。具体来说，我们通过使用概率主题模型表示的类的全局语义信息来生成稀有类（训练数据量相对较小的类）的新样本。这样，使用此变换后的数据集，可以使不同类别中的样本数量变得更加平衡，并且可以提高文本分类的性能。实际上，所提出的方法与传统的重新采样方法不同，传统的重新采样方法试图通过对稀有类中的文档进行重新采样来平衡不同类中的文档数量。这种重新采样方法可能会导致过拟合。我们方法的另一个好处是有效处理有噪声的样本。由于所有新样本都是由主题模型生成的，因此大大降低了嘈杂样本的影响。最后，如实验结果所示，所提出的方法在类不平衡下可以实现更好的性能，并且对噪声样本的容忍度更高。

著录项

来源
《Information Processing & Management》 |2011年第2期|p.202-214|共13页
作者
Enhong Chen; Yanggang Lin; Hui Xiong; Qiming Luo; Haiping Ma;
展开▼
作者单位

School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China;

School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China;

Department of Management Science and Information Systems, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901-8554, USA;

School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China;

School of Computer Science and Technology, P.O. Box 4, Hefei, Anhui 230027, PR China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
class imbalance; rare class analysis; text categorization; probabilistic topic model; noisy data;

机译：阶级失衡;稀有阶级分析;文本分类概率主题模型嘈杂的数据;
入库时间 2022-08-17 23:20:18

相似文献

外文文献
中文文献
专利

1. Using scatterplots to understand and improve probabilistic models for text categorization and retrieval [J] . Giorgio Maria Di Nunzio International Journal of Approximate Reasoning . 2009,第7期

机译：使用散点图来理解和改进用于文本分类和检索的概率模型
2. Improving Text Categorization By Using A Topic Model [J] . Wongkot Sriurai Advanced Computing: an International Journal . 2011,第6期

机译：通过使用主题模型改善文本分类
3. Classification of Text Documents Based on a Probabilistic Topic Model [J] . Scientific & Technical Information Processing . 2019,第5期

机译：基于概率主题模型的文本文档分类
4. An Improved Native Bayes Classifier for Imbalanced Text Categorization Based on K-Means and Chi-Square Feature Selection [C] . Fanbo Meng, Linying Xu International Conference on Instrumentation and Measurement, Computer, Communication and Control . 2018

机译：基于K-均值和卡方特征选择的改进Native Bayes分类器用于文本不平衡分类
5. Probabilistic Topic Modeling and Classification Probabilistic PCA for Text Corpora. [D] . Cheng, Chi Wa. 2011

机译：文本主题的概率主题建模和分类概率PCA。
6. Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization [O] . Jieming Yang, Zhaoyang Qu, Zhiying Liu -1

机译：文本分类中考虑不平衡问题的改进特征选择方法
7. Using Scatterplots to Understand and Improve Probabilistic Models for Text Categorization and Retrieval [O] . DI NUNZIO G.M. 2009

机译：使用散点图理解和改进概率模型进行文本分类和检索

Exploiting probabilistic topic models to improve text categorization under class imbalance

摘要

著录项

相似文献

相关主题

期刊订阅