WORDS VERSUS CHARACTER N-GRAMS FOR ANTI-SPAM FILTERING

IOANNIS KANARIS; KONSTANTINOS KANARIS; IOANNIS HOUVARDAS; EFSTATHIOS STAMATATOS

首页> 外文期刊>International Journal of Artificial Intelligence Tools: Architectures, Languages, Algorithms >WORDS VERSUS CHARACTER N-GRAMS FOR ANTI-SPAM FILTERING

【24h】

WORDS VERSUS CHARACTER N-GRAMS FOR ANTI-SPAM FILTERING

机译：单词与字符N-GRAMS进行反垃圾邮件过滤

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The increasing number of unsolicited e-mail messages (spam) reveals the need for the development of reliable anti-spam filters. The vast majority of content-based techniques rely on word-based representation of messages. Such approaches require reliable tokenizers for detecting the token boundaries. As a consequence, a common practice of spammers is to attempt to confuse tokenizers using unexpected punctuation marks or special characters within the message. In this paper we explore an alternative low-level representation based on character n-grams which avoids the use of tokenizers and other language-dependent tools. Based on experiments on two well-known benchmark corpora and a variety of evaluation measures, we show that character n-grams are more reliable features than word-tokens despite the fact that they increase the dimensionality of the problem. Moreover, we propose a method for extracting variable-length n-grams which produces optimal classifiers among the examined models under cost-sensitive evaluation.

机译：不请自来的电子邮件（垃圾邮件）的数量不断增加，这表明需要开发可靠的反垃圾邮件过滤器。绝大多数基于内容的技术依赖于消息的基于单词的表示。这样的方法需要可靠的令牌生成器来检测令牌边界。因此，垃圾邮件发送者的常规做法是尝试使用消息中的意外标点符号或特殊字符来混淆令牌生成器。在本文中，我们探索了一种基于字符n-gram的替代性低级表示形式，该表示形式避免了使用分词器和其他依赖语言的工具。基于对两个著名基准语料库的实验和各种评估方法，我们表明，尽管字符n-gram增加了问题的维数，但它们却比单词令牌更可靠。此外，我们提出了一种提取可变长度n-gram的方法，该方法可以在成本敏感的评估条件下在所检查的模型中产生最佳分类器。

著录项

来源
《International Journal of Artificial Intelligence Tools: Architectures, Languages, Algorithms》 |2007年第6期|共21页
作者
IOANNIS KANARIS; KONSTANTINOS KANARIS; IOANNIS HOUVARDAS; EFSTATHIOS STAMATATOS;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类人工智能理论;
关键词
Anti-spam filtering; Machine learning; n-grams;

机译：反垃圾邮件过滤;机器学习;n-grams;

相似文献

外文文献
中文文献
专利

1. WORDS VERSUS CHARACTER N-GRAMS FOR ANTI-SPAM FILTERING [J] . IOANNIS KANARIS, KONSTANTINOS KANARIS, IOANNIS HOUVARDAS, International Journal of Artificial Intelligence Tools: Architectures, Languages, Algorithms . 2007,第6期

机译：单词与字符N-GRAMS进行反垃圾邮件过滤
2. Measuring similarity between Karel programs using character and word n-grams [J] . Sidorov G., Ibarra Romero M., Markov I., Programming and Computer Software . 2017,第1期

机译：使用字符和单词n-gram测量Karel程序之间的相似性
3. Automatic Word Spacing Using Probabilistic Models Based on Character n-grams [J] . Do-Gil Lee, Hae-Chang Rim, Dongsuk Yook IEEE intelligent systems . 2007,第期

机译：使用基于字符n元语法的概率模型自动单词间距
4. A Comparative Study of Likelihood Ratio Based Forensic Text Comparison Procedures: Multivariate Kernel Density with Lexical Features vs. Word N-grams vs. Character N-grams [C] . Ishihara Shunichi Cybercrime and Trustworthy Computing Workshop . 2015

机译：基于似然比的法医文本比较程序的比较研究：具有词法特征的多变量内核密度与单词N-grams与字符N-grams
5. Adaptive anti-spam e-mail filtering using Huffman coding and statistical learning. [D] . Nerellapalli, Praveen R. 2005

机译：使用霍夫曼编码和统计学习的自适应反垃圾邮件过滤。
6. Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents [O] . Deepak Agnihotri, Kesari Verma, Priyanka Tripathi -1

机译：计算N-gram的对称强度：文本文档自动分类中的两遍过滤方法
7. An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages [O] . Androutsopoulos, Ion, Koutsias, John, Chandrinos, Konstantinos V., 2000

机译：朴素贝叶斯与基于关键词的反垃圾邮件的实验比较使用个人电子邮件过滤

WORDS VERSUS CHARACTER N-GRAMS FOR ANTI-SPAM FILTERING

摘要

著录项

相似文献

相关主题

期刊订阅