首页> 外文会议>International conference on computer science and it applications >Hadoop Based Parallel Deduplication Method for Web Documents
【24h】

Hadoop Based Parallel Deduplication Method for Web Documents

机译:基于Hadoop的Web文档的并行重复数据删除方法

获取原文

摘要

This paper proposes a method of deleting duplicate web pages through tf-idf and splay tree. According to the keywords which are extracted by TextRank, those pages which may be duplicate copies will be sent to a group. Then these pages will be judged by the method above. We use three Map-Reduce tasks to ensure the method of calculating tf-idf and deleting duplicate web pages. The experiment result shows that the algorithm can remove duplicate web pages efficiently and accurately.
机译:本文提出了一种通过TF-IDF和SPLAY树删除重复网页的方法。根据Textrank提取的关键字,将发送到组的那些可以重复副本的页面。然后将通过上述方法判断这些页面。我们使用三个映射减少任务来确保计算TF-IDF和删除重复网页的方法。实验结果表明,该算法可以有效且准确地去除重复的网页。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号