首页> 外文会议>International Conference on Electrical Engineering and Informatics >Handling imbalanced dataset in multi-label text categorization using Bagging and Adaptive Boosting
【24h】

Handling imbalanced dataset in multi-label text categorization using Bagging and Adaptive Boosting

机译:使用袋装和自适应升压处理多标签文本分类中的不平衡数据集

获取原文

摘要

Imbalanced dataset is occurred due to uneven distribution of data available in the real world such as disposition of complaints on government offices in Bandung. Consequently, multi-label text categorization algorithms may not produce the best performance because classifiers tend to be weighed down by the majority of the data and ignore the minority. In this paper, Bagging and Adaptive Boosting algorithms are employed to handle the issue and improve the performance of text categorization. The result is evaluated with four evaluation metrics such as hamming loss, subset accuracy, example-based accuracy and micro-averaged f-measure. Bagging.ML-LP with SMO weak classifier is the best performer in terms of subset accuracy and example-based accuracy. Bagging.ML-BR with SMO weak classifier has the best micro-averaged f-measure among all. In other hand, AdaBoost.MH with J48 weak classifier has the lowest hamming loss value. Thus, both algorithms have high potential in boosting the performance of text categorization, but only for certain weak classifiers. However, bagging has more potential than adaptive boosting in increasing the accuracy of minority labels.
机译:由于现实世界中可用的数据分布不均衡,因此发生了不平衡的数据集,例如在万隆的政府办公室的投诉处置。因此,多标签文本分类算法可能不会产生最佳性能,因为分类器倾向于被大多数数据称重并忽略少数群体。在本文中,采用袋装和自适应升压算法来处理问题并提高文本分类的性能。结果是用四个评估度量评估,例如汉明损失,子集精度,基于示例性的精度和微平均法测量。 Bagging.ML-LP与SMO弱分类器是在子集准确性和基于示例性准确性方面的最佳表演者。 Bagging.ML-BR与SMO弱分类器有最好的微平均F测量。另一方面,adaboost.mh与J48弱分类器具有最低的汉明损失值。因此,这两种算法都具有很高的潜力,可以提高文本分类的性能,而是仅用于某些弱分类器。然而,袋装比增加少数群体标签的准确性更具适应性提升的潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号