首页> 外文期刊>Pattern recognition letters >Spam detection using Random Boost
【24h】

Spam detection using Random Boost

机译:使用随机增强检测垃圾邮件

获取原文
获取原文并翻译 | 示例
           

摘要

This paper proposes two alternative methods of random projections and compares their performance for robust and efficient spam detection when trained using a small number of examples. Robustness refers to learning and adaptation leading to a high level of performance despite data variability, while efficiency is concerned with (ⅰ) the complexity of the detection method employed; and (ⅱ) the amount of training resources used for training and retraining. The first method, Random Project, employs a random projection matrix to produce linear combinations of input features, while the second method, Random Boost, employs random feature selection to enhance the performance of the Logit Boost algorithm. Random Boost is, in fact, a combination of Logit Boost and Random Forest. Experimental results, using TREC and CEAS as challenging spam benchmark sets, show that the Random Boost method significantly improves the performance of the spam filter compared to the Logit Boost algorithm (e.g., a 5% increase in AUC, which is the area under the Receiver Operating Characteristic curve), and yields similar classification accuracy compared to the Random Forest method but using only one fourth the runtime complexity of the Random Forest algorithm. Additionally, the Random Boost algorithm also reduces training time by two orders of magnitude compared to Logit Boost, which becomes important during retraining on the ever changing data streams, including adapting to adversarial tactics and "noise" injected by spammers.
机译:本文提出了两种可选的随机投影方法,并比较了它们在使用少量示例进行训练时对于鲁棒和有效的垃圾邮件检测的性能。鲁棒性是指学习和适应能力,尽管数据可变,但仍可带来较高的性能,而效率与(ⅰ)所用检测方法的复杂性有关; (ⅱ)用于培训和再培训的培训资源数量。第一种方法,Random Project,使用随机投影矩阵来生成输入特征的线性组合,而第二种方法,Random Boost,使用随机特征选择,以增强Logit Boost算法的性能。实际上,Random Boost是Logit Boost和Random Forest的组合。使用TREC和CEAS作为具有挑战性的垃圾邮件基准集的实验结果表明,与Logit Boost算法相比,Random Boost方法显着提高了垃圾邮件过滤器的性能(例如,AUC增加了5%,这是接收者下方的面积)运行特性曲线),与“随机森林”方法相比,具有相似的分类精度,但仅使用“随机森林”算法的运行时间复杂度的四分之一。此外,与Logit Boost相比,Random Boost算法还减少了两个数量级的训练时间,这在对不断变化的数据流进行重新训练期间非常重要,包括适应对抗性策略和垃圾邮件发送者注入的“噪音”。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号