首页> 外文会议>2010 International Conference on Web Information Systems and Mining >Improving Anti-spam Engine with Large Imbalanced Dataset Using Information Retrieval Technology
【24h】

Improving Anti-spam Engine with Large Imbalanced Dataset Using Information Retrieval Technology

机译:使用信息检索技术改进具有大量不平衡数据集的反垃圾邮件引擎

获取原文

摘要

Anti-spam technology always employs machine learning to identify spam emails. Unfortunately, the email samples used to establish machine learning models are always not in a ideal status: there are too many spam emails compared with normal ones, which may lead to biased machine learning models and unsatisfactory performance in prediction. Besides, there are too many email samples, which lead to unaffordable resource consuming to run machine learning training process and thus difficult for human engineers to sort. In this paper, we proposed an information retrieval technology based approach to compress and balance the training data set . The key breakthrough here is to shrink and balance the training data set by removing similar data using information retrieval technology. Experiments show anti-spam classifier can have better performance with a much smaller and balanced training data set by applying this approach???
机译:反垃圾邮件技术始终利用机器学习来识别垃圾邮件。不幸的是,用于建立机器学习模型的电子邮件样本始终处于不理想状态:与正常电子邮件相比,垃圾邮件过多,这可能导致机器学习模型出现偏差,并且预测性能不尽人意。此外,电子邮件样本太多,这导致无法负担的资源无法运行机器学习训练过程,从而使人类工程师难以分类。在本文中,我们提出了一种基于信息检索技术的方法来压缩和平衡训练数据集。这里的关键突破是通过使用信息检索技术删除相似数据来缩小和平衡训练数据集。实验表明,通过应用这种方法,反垃圾邮件分类器在使用更小且平衡的训练数据集的情况下可以具有更好的性能???

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号