首页> 外文期刊>Artificial Intelligence Review: An International Science and Engineering Journal >Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches
【24h】

Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches

机译:基于文本案例的垃圾邮件过滤推理:基于特征的方法和无特征方法的比较

获取原文
获取原文并翻译 | 示例
           

摘要

Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system. Improvements in the classification time of both kinds of systems can be obtained by applying case base editing algorithms, which aim to remove noisy and redundant cases from a case base while maintaining, or even improving, generalisation accuracy. We report empirical results using the Competence-Based Editing (CBE) technique. We show that CBE removes more cases when we use the distance measure based on text compression (without significant changes in generalisation accuracy) than it does when we use the feature-based approach.
机译:垃圾邮件过滤是一种文本分类任务,已成功应用基于案例的推理(CBR)。我们描述了ECUE系统,该系统使用基于功能的形式的文本CBR对电子邮件进行分类。然后,我们描述一种替代方法,该方法使用基于文本压缩的距离度量以无特征的方式计算案例之间的距离。这种距离测量的优点是无需设置成本,并且可以抵抗概念漂移。我们报告了一个经验比较,它显示了无特征方法比基于特征的系统更加准确。这些结果在不同的压缩算法上都非常可靠,因为我们发现使用Lempel-Ziv压缩器(GZip)时的准确性与使用统计压缩器(PPM)时的准确性大致相同。但是,我们注意到,与基于功能的系统相比,无功能的系统对电子邮件进行分类需要更长的时间。可以通过应用案例库编辑算法来提高两种系统的分类时间,该算法旨在在保持甚至提高泛化准确性的同时,从案例库中删除嘈杂和多余的案例。我们使用基于能力的编辑(CBE)技术报告经验结果。我们证明,与基于特征的方法相比,使用基于文本压缩的距离量度(概化精度无明显变化)时,CBE可以消除更多的情况。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号