首页> 中文期刊> 《计算机仿真》 >一种基于网页指纹的网页查重技术研究

一种基于网页指纹的网页查重技术研究

     

摘要

Study the problem of seeking duplicated web pages. The traditional re-SCAM algorithm determines if the web pages are repeated according to the repeating times of a few key words, When some users browse web pages, if the key words then used are very similar, the miscarriage of justice and re-checking will be resulted and the accuracy is not high. This paper presents an repeat checking algorithm of web page fingerprint. Information retrieval technology is used to extract fingerprint information of the page to be detected, then the fingerprint information is compared with the Web fingerprint of Web page library to complete the repeat checking. This method avoids the low accuracy in traditional algorithm. Experimental results show that the method of repeat cheching of web Fingerprint can accurately determine whether a page is repeated, improve the accuracy of the information page, and achieve satisfactory results.%研究网页查重问题.针对传统的SCAM网页查重算法根据比较几个关键词网页中出现次数来判断网页是否重复,当网站中存在相似网页时,由于其关键词非常相近,导致出现误判,造成查重准确率不高的问题.本文提出一种网页指纹查重算法,通过采用信息检索技术,提取出待检测网页的网页指纹,然后通过与网页库中的网页指纹比较判决,完成网页的查重,避免了传统方法只依靠几个关键词而造成的查重准确率不高的问题.实验证明,这种利用网页指纹查重的方法能准确判断网页是否重复,提高了网页信息的准确性,取得了满意的结果.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号