【24h】

Identifying Spam Web Pages Based on Content Similarity

机译:基于内容相似度识别垃圾网页

获取原文

摘要

The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spam Web pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. In order to improve the quality of Web searches, the number of spam pages on the Web must be reduced, if they cannot be eradicated entirely. In this paper, we present a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, we develop a spam-detection tool that is (ⅰ) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ⅱ) computational inexpensive, since the word-correlation factors used for content analysis are precomputed. We have verified that our spam-detection approach outperforms existing anti-spam methods by an average of 10% in terms of F-measure.
机译:Web为用户提供了丰富的信息。不幸的是,当执行网络搜索时,用户和搜索引擎都面临一个烦人的问题:存在误导性网页,即垃圾邮件网页,这些网页被列在合法网页中。混合的结果降低了搜索引擎的性能,并使需要过滤掉无用信息的用户感到沮丧。为了提高Web搜索的质量,如果不能完全消除垃圾邮件,则必须减少Web上的垃圾邮件页面的数量。在本文中,我们提供了一种新颖的方法来识别标题和正文不匹配和/或隐藏内容所占百分比较低的垃圾邮件网页。通过考虑网页的内容,我们开发了一种(。)可靠的垃圾邮件检测工具,因为我们可以准确地检测出94%的垃圾邮件/合法网页,并且(ⅱ)计算便宜,因为使用了词相关因子用于内容分析的是预先计算的。我们已经证明,就F措施而言,我们的垃圾邮件检测方法比现有的反垃圾邮件方法平均要高出10%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号