...
首页> 外文期刊>International Journal of Intelligent Systems and Applications >A link and Content Hybrid Approach for Arabic Web Spam Detection
【24h】

A link and Content Hybrid Approach for Arabic Web Spam Detection

机译:一种链接和内容混合的阿拉伯网络垃圾邮件检测方法

获取原文

摘要

Some Web sites developers act as spammers and try to mislead the search engines by using illegal Search Engine Optimizations (SEO) tips to increase the rank of their Web documents, to be more visible at the top 10 SERP. This is since gaining more visitors for marketing and commercial goals. This study is a continuation of a series of Arabic Web spam studies conducted by the authors, where this study is dedicated to build the first Arabic content/link Web spam detection system. This Novel system is capable to extract the set of content and link features of Web pages, in order to build the largest Arabic Web spam dataset. The constructed dataset contains three groups with the following three percentages of spam contents: 2%, 30%, and 40%. These three groups with varying percentages of spam contents were collected through the embedded crawler in the proposed system. The automated classification of spam Web pages used based on the features in the benchmark dataset. The proposed system used the rules of Decision Tree; which is considered as the best classifier to detect Arabic content/link Web spam. The proposed system helps to clean the SERP from all URLs referring to Arabic spam Web pages. It produces accuracy of 90.1099% for Arabic content-based, 93.1034% for Arabic link-based, and 89.011% in detecting both Arabic content and link Web spam, based on the collected dataset and conducted analysis.
机译:一些网站开发人员充当垃圾邮件发送者,并试图通过使用非法的搜索引擎优化(SEO)技巧来提高其Web文档的排名,从而在SERP的前10名中更加明显,从而误导搜索引擎。这是因为为营销和商业目标吸引了更多访问者。该研究是作者进行的一系列阿拉伯Web垃圾邮件研究的延续,该研究致力于构建第一个阿拉伯语内容/链接Web垃圾邮件检测系统。这种Novel系统能够提取网页的内容和链接集,以构建最大的阿拉伯语Web垃圾邮件数据集。构造的数据集包含三组,垃圾邮件含量分别占以下三个百分比:2%,30%和40%。这三组垃圾邮件内容的百分比各不相同,是通过建议的系统中的嵌入式爬网程序收集的。根据基准数据集中的功能对垃圾邮件网页进行自动分类。该系统采用决策树规则。它被认为是检测阿拉伯语内容/链接网络垃圾邮件的最佳分类器。提议的系统有助于从引用阿拉伯垃圾邮件网页的所有URL中清除SERP。根据收集的数据集和进行的分析,对于基于阿拉伯语的内容,其准确性为90.1099%,对于基于阿拉伯语的链接,其准确性为93.1034%,对于检测阿拉伯语内容和链接Web垃圾邮件的准确性为89.011%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号