Spam E-Mail Classification by Utilizing N-Gram Features of Hyperlink Texts

机译：利用超级链接文本的N-GRAM功能垃圾邮件电子邮件分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

With the advent of the Internet and reduction of the costs in digital communication, spam has become a key problem in several types of media (i.e. email, social media and micro blog). Further, in recent years, email spamming in particular has been subjected to an exponentially growing threat which affects both individuals and business world. Hence, a large number of studies have been proposed in order to combat with spam emails. In this study, instead of subject or body components of emails, pure use of hyperlink texts along with word level n-gram indexing schema is proposed for the first time in order to generate features to be employed in a spam/ham email classifier. Since the length of link texts in e-mails does not exceed sentence level, we have limited the n-gram indexing up to trigram schema. Throughout the study, provided by COMODO Inc, a novel large scale dataset covering 50.000 link texts belonging to spam and ham emails has been used for feature extraction and performance evaluation. In order to generate the required vocabularies; unigrams, bigrams and trigrams models have been generated. Next, including one active learner, three different machine learning methods (Support Vector Machines, SVM-Pegasos and Naive Bayes) have been employed to classify each link. According to the results of the experiments, classification using trigram based bag-of-words representation reaches up to 98,75% accuracy which outperforms unigram and bigram schemas. Apart from having high accuracy, the proposed approach also preserves privacy of the customers since it does not require any kind of analysis on body contents of e-mails.

机译：随着互联网的出现和数字通信中的成本降低，垃圾邮件已成为几种类型的媒体（即电子邮件，社交媒体和微博）的关键问题。此外，近年来，特别是电子邮件垃圾邮件已经受到指数增长的威胁，影响个人和商业世界。因此，已经提出了大量的研究以与垃圾邮件进行打击。在本研究中，第一次提出了纯粹使用超链接文本的超链接文本与单词级别N-GRAM索引模式。为了生成在垃圾邮件/火腿电子邮件分类器中要使用的功能。由于电子邮件中的链接文本的长度不超过句子级别，因此我们将n-gram索引限制为trigram模式。在整个研究中，由Comodo Inc提供的小型大型数据集，涵盖属于垃圾邮件和HAM电子邮件的50.000个链接文本，已用于特征提取和性能评估。为了产生所需的词汇表;已经生成了Unigrams，Bigrams和Trigrams模型。接下来，包括一个有效的学习者，三种不同的机器学习方法（支持向量机，SVM-PEGASOS和NAIVE Bayes）被用来分类每个链接。根据实验的结果，使用基于三字母的袋式表示的分类达到高达98,75％的精度，精度优于Unigram和Bigram模式。除了高精度之外，建议的方法还保留了客户的隐私，因为它不需要对电子邮件的身体内容的任何类型的分析。

著录项

来源
《International Conference on Application of Information and Communication Technologies》|2017年|441p|共5页
会议地点
作者
A. Selman Bozkir; Esra Sahin; Murat Aydos; Ebru Akcapinar Sezer; Fatih Orhan;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TN91-53;
关键词
Bayes methods; data mining; electronic mail; learning (artificial intelligence); pattern classification; support vector machines; text analysis; unsolicited e-mail; Web sites;

机译：贝叶斯方法;数据挖掘;电子邮件;学习（人工智能）;模式分类;支持向量机;文本分析;未经请求的电子邮件;网站;

相似文献

外文文献
中文文献
专利

1. Medical E-mail Spam Classification using a Score Based System and Immune System Embedded with Feature Selection Process [J] . Khongbantabam Susila Devi, R. Ravi Journal of Pure & Applied Microbiology . 2015,第Speca1期

机译：使用基于分数的系统和嵌入特征选择过程的免疫系统对医学电子邮件垃圾邮件进行分类
2. A Novel Feature Selection Based on One-Way ANOVA F-Test for E-Mail Spam Classification [J] . Nadir Omer Fadl Elssied, Othman Ibrahim, Ahmed Hamza Osman Research journal of applied science, engineering and technology . 2014,第3期

机译：基于单向ANOVA F检验的电子邮件垃圾邮件分类新特征选择
3. A Novel Feature Selection Based on One-Way ANOVA F-Test for E-Mail Spam Classification [J] . Nadir Omer Fadl Elssied, Othman Ibrahim, Ahmed Hamza Osman Research journal of applied science, engineering and technology . 2014,第3期

机译：基于单向ANOVA F检验的电子邮件垃圾邮件分类新特征选择
4. Spam E-Mail Classification by Utilizing N-Gram Features of Hyperlink Texts [C] . A. Selman Bozkir, Esra Sahin, Murat Aydos, IEEE International Conference on Application of Information and Communication Technologies . 2017

机译：利用超链接文本的N-Gram功能对垃圾邮件进行分类
5. Examination and utilization of rare features in text classification of injury narratives. [D] . Huang, Hsin-Ying. 2016

机译：检查和利用伤害叙事文本分类中的罕见功能。
6. Text Categorization of Heart Lung and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features [O] . Mindy K. Ross, Ko-Wei Lin, Karen Truong, 2013

机译：利用n-gram和元数据特征对基因型和表型（dbGaP）数据库中的心脏肺和血液研究进行文本分类
7. N-grams based feature selection and text representation for Chinese Text Classification [O] . Zhihua Wei, Duoqian Miao, Jean-Hugues Chauchat, 2009

机译：基于N-GRAMS的特征选择和文本分类的文本表示

Spam E-Mail Classification by Utilizing N-Gram Features of Hyperlink Texts

摘要

著录项

相似文献

相关主题

期刊订阅