首页> 外文会议>International Conference on Parallel and Distributed Computing, Applications and Technologies >Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages
【24h】

Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages

机译:大规模中文网页的并行近重复文档检测算法

获取原文

摘要

A large scale of duplicate and near-duplicate web pages on the Internet create a lot of problems for search engines. Currently each single duplicate and near-duplicate web document detection algorithms cannot achieve both good performance and accuracy. Also most of them are designed to process English documents and not able to use for Chinese documents. This paper presents an integrated algorithm, KMatch, for near-duplicate document detection of large scale Chinese Web pages. First of all, KMatch employs Chinese segmentation algorithm to prepare Chinese words into meaningful features to compress documents. Then keywords matching technique is used to improve the accuracy of document detection. For further accuracy improvement, KMatch also combines IMatch algorithms to filter out the noise contents of a web document and retain the body text. To improve detection performance, we integrate the Shingling algorithm to compress huge datasets into smaller ones. Finally, to further improve the detection performance on large scale Chinese web pages, we design and implement KMatch algorithm in parallel with MapReduce. The experimental results show that our approach achieves both high precision and recall, and the parallelized algorithm with MapReduce achieves good performance and scalability when dealing with large scale of datasets.
机译:Internet上大量重复和几乎重复的网页给搜索引擎带来了很多问题。当前,每个单个重复和几乎重复的Web文档检测算法都无法同时实现良好的性能和准确性。它们中的大多数还被设计为处理英文文档,而不能用于中文文档。本文提出了一种用于大规模中文网页的近重复文档检测的集成算法KMatch。首先,KMatch使用中文分割算法将中文单词准备成有意义的特征以压缩文档。然后使用关键词匹配技术来提高文档检测的准确性。为了进一步提高准确性,KMatch还结合了IMatch算法,以过滤掉Web文档中的噪音内容并保留正文。为了提高检测性能,我们集成了Shingling算法,将庞大的数据集压缩为较小的数据集。最后,为了进一步提高大规模中文网页的检测性能,我们设计并实现了与MapReduce并行的KMatch算法。实验结果表明,该方法具有较高的查全率和查全率,而MapReduce并行化算法在处理大规模数据集时具有良好的性能和可伸缩性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号