首页> 中文期刊> 《科学技术与工程》 >一种基于特征向量的改进DSC网页去重算法

一种基于特征向量的改进DSC网页去重算法

         

摘要

网页去重具有很重要的实际意义,也是信息检索领域近几年研究的热点.分析现有的网页去重算法,并对经典的DSC(digital syntactic clustering)网页去重算法进行改进.为每篇文档生成一个特征向量集合,用该特征向量集合筛选shingles;然后进行相似性比较.实验表明,该算法对重复网页判定具有很好的准确率和召回率.%Removing duplicated Webpages can improve the performance of search engines, and it has been one of research issues in todays information retrieving research. The main popular duplicated Webpages detecting methods is analysed, and algorithm is modified the traditional DSC to select the shingles through the feature vectors of the document, and then compared the similarity of two documents. The experimental results show that the method has achieved a good performance in recall and precision.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号