Hadoop Based Parallel Deduplication Method for Web Documents

机译：基于Hadoop的Web文档的并行重复数据删除方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper proposes a method of deleting duplicate web pages through tf-idf and splay tree. According to the keywords which are extracted by TextRank, those pages which may be duplicate copies will be sent to a group. Then these pages will be judged by the method above. We use three Map-Reduce tasks to ensure the method of calculating tf-idf and deleting duplicate web pages. The experiment result shows that the algorithm can remove duplicate web pages efficiently and accurately.

机译：本文提出了一种通过TF-IDF和SPLAY树删除重复网页的方法。根据Textrank提取的关键字，将发送到组的那些可以重复副本的页面。然后将通过上述方法判断这些页面。我们使用三个映射减少任务来确保计算TF-IDF和删除重复网页的方法。实验结果表明，该算法可以有效且准确地去除重复的网页。

著录项

来源
《International conference on computer science and it applications》|2018年|xxxix 733 p.|共6页
会议地点
作者
Junjie Song; Jin Liu; Yuhui Zheng;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
Webpage deduplication; Vertical search engine;

机译：网页重复数据删除;垂直搜索引擎;

相似文献

外文文献
中文文献
专利

1. A Hadoop based platform for natural language processing of web pages and documents [J] . Karim Hadjar Computing reviews . 2016,第5期

机译：基于Hadoop的平台，用于自然语言处理网页和文档
2. A hadoop based platform for natural language processing of web pages and documents [J] . Nesi Paolo, Pantaleo Gianni, Sanesi Gianmarco Journal of Visual Languages & Computing . 2015,第DECaPTaB期

机译：基于hadoop的平台，用于自然语言处理网页和文档
3. Parallelization of a graph-cut based algorithm for hierarchical clustering of web documents [J] . Karthick Seshadri, S. Mercy Shalinie Concurrency and computation: practice and experience . 2015,第17期

机译：Web文档分层聚类的基于图割的算法的并行化
4. Hadoop Based Parallel Deduplication Method for Web Documents [C] . Junjie Song, Jin Liu, Yuhui Zheng International conference on computer science and it applications . 2018

机译：基于Hadoop的Web文档的并行重复数据删除方法
5. Clustering Web documents: A phrase-based method for grouping search engine results. [D] . Zamir, Oren Eli. 1999

机译：Web文档群集：一种基于短语的方法，用于对搜索引擎结果进行分组。
6. MR-Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services [O] . Brian Pratt, J. Jeffry Howbert, Natalie I. Tasman, -1

机译：MR-Tandem：在Amazon Web Services上使用Hadoop MapReduce进行并行X！Tandem
7. A hadoop based platform for natural language processing of web pages and documents [O] . Paolo Nesi, Gianni Pantaleo, Gianmarco Sanesi 2015

机译：基于Hadoop的网页和文档的自然语言处理平台
8. Toward Webscale, Rule-Based Inference on the Semantic Web Via Data Parallelism. [R] . Weaver, J. 2013

机译：走向Webscale，基于规则的语义Web推理通过数据并行。

Hadoop Based Parallel Deduplication Method for Web Documents

摘要

著录项

相似文献

相关主题

期刊订阅