...
【24h】

WORDS VERSUS CHARACTER N-GRAMS FOR ANTI-SPAM FILTERING

机译:单词与字符N-GRAMS进行反垃圾邮件过滤

获取原文
获取原文并翻译 | 示例
           

摘要

The increasing number of unsolicited e-mail messages (spam) reveals the need for the development of reliable anti-spam filters. The vast majority of content-based techniques rely on word-based representation of messages. Such approaches require reliable tokenizers for detecting the token boundaries. As a consequence, a common practice of spammers is to attempt to confuse tokenizers using unexpected punctuation marks or special characters within the message. In this paper we explore an alternative low-level representation based on character n-grams which avoids the use of tokenizers and other language-dependent tools. Based on experiments on two well-known benchmark corpora and a variety of evaluation measures, we show that character n-grams are more reliable features than word-tokens despite the fact that they increase the dimensionality of the problem. Moreover, we propose a method for extracting variable-length n-grams which produces optimal classifiers among the examined models under cost-sensitive evaluation.
机译:不请自来的电子邮件(垃圾邮件)的数量不断增加,这表明需要开发可靠的反垃圾邮件过滤器。绝大多数基于内容的技术依赖于消息的基于单词的表示。这样的方法需要可靠的令牌生成器来检测令牌边界。因此,垃圾邮件发送者的常规做法是尝试使用消息中的意外标点符号或特殊字符来混淆令牌生成器。在本文中,我们探索了一种基于字符n-gram的替代性低级表示形式,该表示形式避免了使用分词器和其他依赖语言的工具。基于对两个著名基准语料库的实验和各种评估方法,我们表明,尽管字符n-gram增加了问题的维数,但它们却比单词令牌更可靠。此外,我们提出了一种提取可变长度n-gram的方法,该方法可以在成本敏感的评估条件下在所检查的模型中产生最佳分类器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号