...
首页> 外文期刊>Knowledge and information systems >Online active multi-field learning for efficient email spam filtering
【24h】

Online active multi-field learning for efficient email spam filtering

机译:在线主动多领域学习,可有效过滤电子邮件垃圾邮件

获取原文
获取原文并翻译 | 示例

摘要

Email spam causes a serious waste of time and resources. This paper addresses the email spam filtering problem and proposes an online active multi-field learning approach, which is based on the following ideas: (1) Email spam filtering is an online application, which suggests an online learning idea; (2) Email document has a multi-field text structure, which suggests a multi-field learning idea; and (3) It is costly to obtain a label for a real-world email spam filter, which suggests an active learning idea. The online learner regards the email spam filtering as an incremental supervised binary streaming text classification. The multi-field learner combines multiple results predicted by field classifiers in a novel compound weight schema, and each field classifier calculates the arithmetical average of multiple conditional probabilities calculated from feature strings according to a data structure of string-frequency index. Comparing the current variance of field classifying results with the historical variance, the active learner evaluates the classifying confidence and takes the more uncertain email as the more informative sample for which to request a label. The experimental results show that the proposed approach can achieve the state-of-the-art performance with greatly reduced label requirements and very low space-time costs. The performance of our online active multi-field learning, the standard (1-ROCA)% measurement, even exceeds the full feedback performance of some advanced individual text classification algorithms.
机译:电子邮件垃圾邮件会严重浪费时间和资源。本文针对电子邮件垃圾邮件过滤问题,提出了一种基于以下思想的在线主动多领域学习方法:(1)电子邮件垃圾邮件过滤是一种在线应用程序,提出了一种在线学习思路。 (2)电子邮件文档具有多字段文本结构,这表明了多字段学习的思想; (3)为现实的电子邮件垃圾邮件过滤器获取标签很昂贵,这表明了一种积极的学习思路。在线学习者将电子邮件垃圾邮件过滤视为一种有监督的增量式二进制流文本分类。该多字段学习器将字段分类器预测的多个结果组合成一个新颖的复合权重方案,并且每个字段分类器根据字符串频率索引的数据结构,计算从特征字符串计算出的多个条件概率的算术平均值。将字段分类结果的当前方差与历史方差进行比较,积极的学习者会评估分类的置信度,并以不确定性更高的电子邮件作为请求标签的信息量更大的样本。实验结果表明,所提出的方法可以实现最先进的性能,同时大大降低了标签要求,并降低了时空成本。我们的在线主动多场学习的性能(标准(1-ROCA)%测量)甚至超过了某些高级个人文本分类算法的完整反馈性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号