The growing problem of unsolicited bulk e-mail, also known as "spam", has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in "encrypted" form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.
不请自来的批量电子邮件(也称为“垃圾邮件”)的日益严重的问题引起了对可靠的反垃圾邮件过滤器的需求。到目前为止,这种类型的过滤器主要基于手动构建的关键字模式。最近提出了一种替代方法,通过该方法可以自动训练朴素贝叶斯分类器以检测垃圾邮件。我们在大量个人电子邮件消息上测试了此方法,我们以“加密”形式向公众公开这些消息,这些消息有助于实现标准基准测试。我们引入了适当的成本敏感措施,同时调查了属性集大小,训练语料库大小,词形化和停止列表的影响,而这些都是先前实验中未曾探讨过的问题。最后,就性能而言,将朴素贝叶斯过滤器与使用关键字模式的过滤器进行比较,该过滤器是广泛使用的电子邮件阅读器的一部分。 P>
机译:垃圾邮件对商业和经济的影响:使用成熟文件处理和Naive Bayesian分类的文本反垃圾邮件过滤的理论和实验研究
机译:基于社交网络从电子邮件中的未经请求消息过滤
机译:评估基于规则的统计过滤器以检测阿拉伯语电子邮件警报消息
机译:基于朴素贝叶斯分类器和分布式校验和信息交换所的反垃圾邮件过滤系统
机译:使用霍夫曼编码和统计学习的自适应反垃圾邮件过滤。
机译:贝叶斯占用滤波器的FPGA和GPGPU设计的比较
机译:朴素贝叶斯与基于关键词的反垃圾邮件的实验比较 使用个人电子邮件过滤