首页> 外文学位 >Machine learning for image spam detection: From server to client solution.
【24h】

Machine learning for image spam detection: From server to client solution.

机译:用于图像垃圾邮件检测的机器学习:从服务器到客户端解决方案。

获取原文
获取原文并翻译 | 示例

摘要

Spam has become a public hazard of email users around the world. While spammers are earning significant amount of money by sending spam emails in massive fashion, globally they cause a lot of economical loss to both individual and enterprise users due to the waste of valuable network resources. While spam filtering technologies have been significantly advanced, malicious spammers are constantly creating sophisticated new weapons in their arms race with anti-spam technologies, the latest of which is image spam.;Image spam is a type of email spam that embeds text content into graphical images to bypass traditional spam filters based on statistics of text characters. Ensuring that the embedded text content be readable, image spammers leverage a set of image processing technologies to vary the visual content of individual messages, e.g., by changing foreground colors, backgrounds, font types, or even rotating and adding artifacts to the images. Thus, they pose great challenges to conventional spam filters since we need to partly resolve visual recognition problems, which are in general difficult to address.;To effectively detect spam images, it is desirable to apply image content analysis technologies to identifying them on both server side and client side. Due to the fundamentally adversarial behavior from image spammers, we extensively employ various machine learning technologies, ranging from unsupervised cluster analysis, semi-supervised or supervised classification, to more interactive active learning algorithms, to effectively analyze the statistics of visual features. Hence we are able to achieve a comprehensive solution for spam filtering to meet with different kinds of system and usage requirements. Compared to previous works, which mostly filter the spam images on the client side, we present a more desirable comprehensive solution which embraces both server side filtering and client side detection to effectively mitigate image spam.;On the server side, depending how much human labor we may expend to collect labeled data, we design and investigate several different image spam systems. In particular, when there are no manual labeling efforts, we proposed a nonnegative sparsity induced similarity metric for cluster analysis of spam images. When there is limited number of labeled data, we propose a spam filtering system based on a novel semi-supervised algorithm, namely regularized discriminant EM (RDEM), which effectively utilizes the scarce labeled image data and the manifold structure of the unlabeled data for classification analysis. Last but not least, when we have accumulated enough labeled data, we can further leverage supervised machine learning algorithms such as probabilistic boosting tree (PBT) to build a fully automated classifier for identifying spam images.;On the client side, we employ the principle of active learning where the learning machinery guides the users to label as few images as possible while maximizing the classification accuracy. In our exploration, we systematically present our study of two active learning algorithms, which are based on a SVM and a Gaussian process classifier respectively. Semisupervised algorithm RDEM and supervised algorithm PBT can also apply to the client side when more labeled data or large amount of labeled data can be collected.;The server side filtering identifies suspicious spam sources and further analysis can be performed to identify the real sources and block them from the beginning. For those spam images which survived the server side filtering, our active learner on the client side will further guide the users to interactively and efficiently filter them out. Our experiments on an image spam data-set collected from the email server of our department demonstrate the efficacy of the proposed comprehensive solution.
机译:垃圾邮件已成为全球电子邮件用户的公共危害。垃圾邮件发送者通过以大量方式发送垃圾邮件来赚取可观的收入,但在全球范围内,由于浪费宝贵的网络资源,它们给个人和企业用户造成了很多经济损失。尽管垃圾邮件过滤技术已经得到了极大的改进,但恶意垃圾邮件发送者仍在不断利用反垃圾邮件技术在其军备竞赛中创建复杂的新武器,其中最新的是图像垃圾邮件;图像垃圾邮件是一种将文本内容嵌入图形形式的电子邮件垃圾邮件。图片以绕过基于文字字符统计信息的传统垃圾邮件过滤器。为确保嵌入的文本内容可读,图像垃圾邮件发送者利用一组图像处理技术来更改单个消息的视觉内容,例如,通过更改前景色,背景,字体类型,甚至旋转图像并向图像添加伪像。因此,由于我们需要部分解决通常很难解决的视觉识别问题,因此它们对常规垃圾邮件过滤器构成了巨大挑战。为了有效地检测垃圾邮件图像,希望在两个服务器上应用图像内容分析技术来识别它们方面和客户端。由于来自垃圾邮件发送者的根本对抗行为,我们广泛采用了各种机器学习技术,从无监督聚类分析,半监督或监督分类到更具交互性的主动学习算法,可以有效地分析视觉特征的统计信息。因此,我们能够实现垃圾邮件过滤的全面解决方案,以满足不同类型的系统和使用要求。与以前主要在客户端过滤垃圾邮件图像的工作相比,我们提出了一种更理想的综合解决方案,该解决方案同时包含服务器端过滤和客户端检测以有效缓解图像垃圾邮件。在服务器端,这取决于人工量我们可能会花费大量时间来收集标记的数据,并设计和研究几种不同的图像垃圾邮件系统。特别是在没有人工标记的情况下,我们提出了一种非负稀疏性相似度度量标准,用于对垃圾邮件图像进行聚类分析。当标签数据数量有限时,我们提出了一种基于新型半监督算法的垃圾邮件过滤系统,即正则判别EM(RDEM),该系统可有效利用稀缺的标签图像数据和未标签数据的流形结构进行分类分析。最后但并非最不重要的一点是,当我们积累了足够的标记数据时,我们可以进一步利用受​​监督的机器学习算法(例如概率增强树(PBT))来构建用于识别垃圾邮件图像的全自动分类器。在客户端,我们采用该原理主动学习的过程,其中学习机引导用户标记尽可能少的图像,同时最大程度地提高分类精度。在探索中,我们系统地介绍了两种主动学习算法的研究,这两种算法分别基于SVM和高斯过程分类器。当可以收集更多标记数据或大量标记数据时,半监督算法RDEM和监督算法PBT也可以应用于客户端。服务器端过滤可识别可疑垃圾邮件源,并可以进行进一步分析以识别真实源和阻止他们从一开始。对于那些在服务器端过滤后仍然有效的垃圾邮件图像,我们在客户端的活跃学习者将进一步指导用户以交互方式有效地将其过滤掉。我们对从我们部门的电子邮件服务器收集的图像垃圾邮件数据集进行的实验证明了所提出的全面解决方案的功效。

著录项

  • 作者

    Gao, Yan.;

  • 作者单位

    Northwestern University.;

  • 授予单位 Northwestern University.;
  • 学科 Engineering Computer.;Computer Science.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 105 p.
  • 总页数 105
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号