首页> 外文会议>International Conference on Knowledge Science, Engineering and Management >Imbalanced Web Spam Classification Using Self-labeled Techniques and Multi-classifier Models
【24h】

Imbalanced Web Spam Classification Using Self-labeled Techniques and Multi-classifier Models

机译:使用自标记技术和多分类器模型的不平衡网络垃圾邮件分类

获取原文

摘要

Web spam has become a critical problem in web search area. Unfortunately, highly imbalanced distribution and too many unlabeled instances always disturb the performance of classifiers. In this paper, we focus on solving the serious imbalance distribution of web spam under the semi-supervised learning frame. First, we introduce the self-labeled techniques and the multi-classifier mode. Second, the imbalance situation of web spam data sets and five combination methods are proposed. Particularly, we propose several improved self-labeled methods by using classic over-sampling technique SMOTE in pre-processing stage, and then balance the uneven labeled sets. Further, considering the serious imbalance situation of web spam, we introduce the AUC value into semi-supervised classification. Experiments under WEBSPAM UK2007 indicate that our methods can get better performance both on recall and AUC values.
机译:Web Spam已成为Web搜索区域的关键问题。不幸的是,高度不平衡的分布和太多未标记的实例总是扰乱分类器的性能。在本文中,我们专注于解决半监督学习框架下网垃圾邮件的严重不平衡分布。首先,我们介绍了自我标记的技术和多分类器模式。其次,提出了Web垃圾邮件数据集的不平衡情况和五种组合方法。特别是,我们通过使用经典的过采样技术在预处理阶段中缩小了几种改进的自我标记方法,然后平衡了不均标记的集合。此外,考虑到Web垃圾邮件的严重不平衡情况,我们将AUC值介绍为半监督分类。 WebSPAM UK2007下的实验表明我们的方法可以在召回和AUC值上获得更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号