Towards Web Spam Filtering Using a Classifier Based on the Minimum Description Length Principle

机译：使用基于最小描述长度原理的分类器进行Web垃圾邮件过滤

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The steady growth and popularization of the Web has led spammers to develop techniques to circumvent search engines aiming good visibility to their web pages in search results. They are responsible for serious problems such as dissatisfaction, irritation, exposure to unpleasant or malicious content, and financial loss. Despite different machine learning approaches have been used to detect web spam, many of them suffer with the curse of dimensionality or require a very high computational cost impeding their employment in real scenarios. In this way, there is still a big effort to develop more advanced methods that at the same time are able to prevent overfitting and fast to learn. To fill this gap, we present the MDLClass, a classifier technique based on the minimum description length principle, applied to the context of web spam filtering. The proposed method is very efficient, lightweight, multi-class, and fast. We also evaluated a new approach to detect web spam that combines the predictions obtained by the classifiers using content-based, link-based, and transformed link-based features. In our experiments, we employed two real, public and large datasets: the WEBSPAM-UK2006 and the WEBSPAM-UK2007. The results indicate that the proposed MDLClass and ensemble of predictions using different types of features are promising in the task of web spam filtering.

机译：Web的稳定增长和普及导致垃圾邮件发送者开发了一些技术来规避搜索引擎，这些引擎旨在使其搜索结果中的网页具有良好的可见性。他们应对严重的问题负责，例如不满，恼怒，接触不愉快或恶意的内容以及经济损失。尽管已使用不同的机器学习方法来检测Web垃圾邮件，但其中许多方法都遭受了维度的诅咒或需要很高的计算成本，从而阻碍了其在实际场景中的使用。以此方式，仍在努力开发更先进的方法，同时又能够防止过拟合和快速学习。为了填补这一空白，我们介绍了MDLClass，这是一种基于最小描述长度原则的分类器技术，适用于Web垃圾邮件过滤的上下文。所提出的方法非常有效，轻量级，多类且快速。我们还评估了一种检测Web垃圾邮件的新方法，该方法结合了分类器使用基于内容，基于链接和转换后的基于链接的功能所获得的预测。在我们的实验中，我们使用了两个真实的，公共的和大型的数据集：WEBSPAM-UK2006和WEBSPAM-UK2007。结果表明，提出的MDLClass和使用不同类型功能的预测集合在Web垃圾邮件过滤任务中很有希望。

著录项

来源
《IEEE International Conference on Machine Learning and Applications》|2016年|470-475|共6页
会议地点
作者
Renato M. Silva; Tiago A. Almeida; Akebo Yamakami;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Web pages; Feature extraction; Training; Data models; Search engines; Computational efficiency; Computational modeling;

机译：网页;特征提取;训练;数据模型;搜索引擎;计算效率;计算建模;

相似文献

外文文献
中文文献
专利

1. Clustgrams: an extension to histogram densities based on the minimum description length principle [J] . Panu Luosto, Petri Kontkanen Open Computer Science . 2011,第4期

机译：Clustgrams：基于最小描述长度原理的直方图密度扩展
2. Clustering of a set of identified points on images of dynamic scenes, based on the principle of minimum description length [J] . Peterson M.V. Journal of optical technology . 2010,第11期

机译：基于最小描述长度的原理，对动态场景图像上的一组已识别点进行聚类
3. Histograms based on the minimum description length principle [J] . Hai Wang, Kenneth C. Sevcik The VLDB journal . 2008,第3期

机译：基于最小描述长度原理的直方图
4. Towards Web Spam Filtering using a Classifier based on the Minimum Description Length Principle [C] . Renato M. Silva, Akebo Yamakami, Tiago A. Almeida IEEE International Conference on Machine Learning and Applications . 2016

机译：使用基于最小描述长度原理使用分类器的Web垃圾邮件过滤
5. Context-free graph grammar induction using the minimum description length principle. [D] . Jonyer, Istvan. 2003

机译：使用最小描述长度原则的无上下文图文法归纳。
6. Analysis of Web Spam for Non-English Content: Toward More Effective Language-Based Classifiers [O] . Mansour Alsaleh, Abdulrahman Alarifi -1

机译：非英语内容的Web垃圾邮件分析：寻求更有效的基于语言的分类器
7. MVGL analyser for multi-classifier based spam filtering system [O] . Islam, Md Rafiqul, Zhou, Wanlei, Chowdhury, Morshed U 2009

机译：MVGL分析器，用于基于多分类器的垃圾邮件过滤系统
8. Low-Rank Data Modeling via the Minimum Description Length Principle [R] . Ramirez, I., Sapiro, G. 2011

机译：通过最小描述长度原理的低秩数据建模

Towards Web Spam Filtering Using a Classifier Based on the Minimum Description Length Principle

摘要

著录项

相似文献

相关主题

期刊订阅