首页> 外文期刊>Journal of Experimental & Theoretical Artificial Intelligence >Training SVM email classifiers using very large imbalanced dataset
【24h】

Training SVM email classifiers using very large imbalanced dataset

机译:使用非常大的不平衡数据集训练SVM电子邮件分类器

获取原文
获取原文并翻译 | 示例
           

摘要

The Internet has been flooded with spam emails, and during the last decade therenhas been an increasing demand for reliable anti-spam email filters. The problemnof filtering emails can be considered as a classification problem in the field ofnsupervised learning. Theoretically, many mature technologies, for example,nsupport vector machines (SVM), can be used to solve this problem. However, innreal enterprise applications, the training data are typically collected via honeypotsnand thus are always of huge amounts and highly biased towards spam emails.nThis challenges both efficiency and effectiveness of conventional technologies.nIn this article, we propose an undersampling method to compress and balance thentraining set used for the conventional SVM classifier with minimal informationnloss. The key observation is that we can make a trade-off between training set sizenand information loss by carefully defining a similarity measure between datansamples. Our experiments show that the SVM classifier provides a betternperformance by applying our compressing and balancing approach.
机译:互联网已被垃圾邮件充斥,在过去十年中,对可靠的反垃圾邮件过滤器的需求不断增长。过滤电子邮件的问题可以看作是监督学习领域中的分类问题。从理论上讲,许多成熟的技术,例如nsupport向量机(SVM),都可以用来解决此问题。但是,在非现实企业应用程序中,培训数据通常是通过蜜罐收集的,因此总是庞大且偏向于垃圾邮件。n这对传统技术的效率和有效性都提出了挑战。n在本文中,我们提出了一种欠采样方法来压缩和平衡然后用于传统SVM分类器的训练集具有最小的信息损失。关键的观察结果是,通过仔细定义数据样本之间的相似性度量,我们可以在训练集大小和信息损失之间进行权衡。我们的实验表明,通过使用我们的压缩和平衡方法,SVM分类器可提供更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号