Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages

机译：大规模中文网页的并行近重复文档检测算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A large scale of duplicate and near-duplicate web pages on the Internet create a lot of problems for search engines. Currently each single duplicate and near-duplicate web document detection algorithms cannot achieve both good performance and accuracy. Also most of them are designed to process English documents and not able to use for Chinese documents. This paper presents an integrated algorithm, KMatch, for near-duplicate document detection of large scale Chinese Web pages. First of all, KMatch employs Chinese segmentation algorithm to prepare Chinese words into meaningful features to compress documents. Then keywords matching technique is used to improve the accuracy of document detection. For further accuracy improvement, KMatch also combines IMatch algorithms to filter out the noise contents of a web document and retain the body text. To improve detection performance, we integrate the Shingling algorithm to compress huge datasets into smaller ones. Finally, to further improve the detection performance on large scale Chinese web pages, we design and implement KMatch algorithm in parallel with MapReduce. The experimental results show that our approach achieves both high precision and recall, and the parallelized algorithm with MapReduce achieves good performance and scalability when dealing with large scale of datasets.

机译：Internet上大量重复和几乎重复的网页给搜索引擎带来了很多问题。当前，每个单个重复和几乎重复的Web文档检测算法都无法同时实现良好的性能和准确性。它们中的大多数还被设计为处理英文文档，而不能用于中文文档。本文提出了一种用于大规模中文网页的近重复文档检测的集成算法KMatch。首先，KMatch使用中文分割算法将中文单词准备成有意义的特征以压缩文档。然后使用关键词匹配技术来提高文档检测的准确性。为了进一步提高准确性，KMatch还结合了IMatch算法，以过滤掉Web文档中的噪音内容并保留正文。为了提高检测性能，我们集成了Shingling算法，将庞大的数据集压缩为较小的数据集。最后，为了进一步提高大规模中文网页的检测性能，我们设计并实现了与MapReduce并行的KMatch算法。实验结果表明，该方法具有较高的查全率和查全率，而MapReduce并行化算法在处理大规模数据集时具有良好的性能和可伸缩性。

著录项

来源
《International Conference on Parallel and Distributed Computing, Applications and Technologies》|2012年|523-528|共6页
会议地点
作者
Wei Yongzhuang; Wang Shuai; Yuan Chunfeng; Huang Yihua;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Chinese web pages; IMatch; KMatch; Keywords matching; Large scale web documents; MapReduce; Near-duplicate document detection; Shingling;

机译：中文网页; IMatch; KMatch;关键字匹配;大型Web文档; MapReduce;近重复文档检测;带盖;

相似文献

外文文献
中文文献
专利

1. A Near-Duplicate Detection Algorithm to Facilitate Document Clustering [J] . Lavanya Pamulaparty, Dr. C.V Guru Rao, Dr. M. Sreenivasa Rao International Journal of Data Mining & Knowledge Management Process . 2014,第6期

机译：一种促进文档聚类的近重复检测算法
2. Parallelization of a graph-cut based algorithm for hierarchical clustering of web documents [J] . Karthick Seshadri, S. Mercy Shalinie Concurrency and computation: practice and experience . 2015,第17期

机译：Web文档分层聚类的基于图割的算法的并行化
3. A Parallel Hybrid Web Document Clustering Algorithm and its Performance Study [J] . SHUTING XU, JUN ZHANG Journal of supercomputing . 2004,第2期

机译：并行混合Web文档聚类算法及其性能研究
4. Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages [C] . Wei Yongzhuang, Wang Shuai, Yuan Chunfeng, International Conference on Parallel and Distributed Computing, Applications and Technologies . 2012

机译：大规模中文网页的并行近重复文档检测算法
5. A near real-time, highly scalable, parallel and distributed adaptive object detection and re-training framework based on the AdaBoost algorithm [D] . Abualkibash, Munther 2015

机译：基于AdaBoost算法的近实时，高度可扩展，并行和分布式的自适应对象检测和再训练框架
6. Large Scale Near-Duplicate Celebrity Web Images Retrieval Using Visual and Textual Features [O] . Fengcai Qiao, Cheng Wang, Xin Zhang, 2013

机译：使用视觉和文字功能进行大规模近乎重复的名人Web图像检索
7. Candidate Document Retrieval for Web-Scale Text Reuse Detection [O] . Matthias Hagen, Benno Stein 2011

机译：Web级文本重用检测的候选文档检索

Parallelized Near-Duplicate Document Detection Algorithm for Large Scale Chinese Web Pages

摘要

著录项

相似文献

相关主题

期刊订阅