网页去重具有很重要的实际意义,也是信息检索领域近几年研究的热点.分析现有的网页去重算法,并对经典的DSC(digital syntactic clustering)网页去重算法进行改进.为每篇文档生成一个特征向量集合,用该特征向量集合筛选shingles;然后进行相似性比较.实验表明,该算法对重复网页判定具有很好的准确率和召回率.%Removing duplicated Webpages can improve the performance of search engines, and it has been one of research issues in todays information retrieving research. The main popular duplicated Webpages detecting methods is analysed, and algorithm is modified the traditional DSC to select the shingles through the feature vectors of the document, and then compared the similarity of two documents. The experimental results show that the method has achieved a good performance in recall and precision.
展开▼