首页> 外文期刊>Concurrency and Computation >Efficient and exact duplicate detection on cloud
【24h】

Efficient and exact duplicate detection on cloud

机译:在云上进行高效且精确的重复检测

获取原文
获取原文并翻译 | 示例

摘要

As the recent proliferation of social networks, mobile applications, and online services increased the rate of data gathering, to find near-duplicate records efficiently has become a challenging issue. Related works on this problem mainly aim to propose efficient approaches on a single machine. However, when processing large-scale dataset, the performance to identify duplicates is still far from satisfactory. In this paper, we try to handle the problem of duplicate detection applying MapReduce. We argue that the performance of utilizing MapReduce to detect duplicates mainly depends on the number of candidate record pairs and intermediate result size, which is related to the shuffle cost among different nodes in cluster. In this paper, we proposed a new signature scheme with new pruning strategies to minimize the number of candidate pairs and intermediate result size. The proposed solution is an exact one, which assures none duplicate record pair can be lost. The experimental results over both real and synthetic datasets demonstrate that our proposed signature-based method is efficient and scalable.
机译:随着近来社交网络,移动应用程序和在线服务的激增,提高了数据收集的速度,如何有效地查找几乎重复的记录已成为一个具有挑战性的问题。有关此问题的相关工作主要旨在在单台机器上提出有效的方法。但是,在处理大规模数据集时,识别重复项的性能仍然远远不能令人满意。在本文中,我们尝试使用MapReduce处理重复检测问题。我们认为利用MapReduce检测重复项的性能主要取决于候选记录对的数量和中间结果的大小,这与集群中不同节点之间的洗牌成本有关。在本文中,我们提出了一种具有新修剪策略的新签名方案,以最小化候选对的数量和中间结果的大小。提出的解决方案是一种精确的解决方案,可确保不会丢失任何重复的记录对。在真实和合成数据集上的实验结果表明,我们提出的基于签名的方法是有效且可扩展的。

著录项

  • 来源
    《Concurrency and Computation》 |2013年第15期|2187-2206|共20页
  • 作者单位

    Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China,School of Information, Renmin University of China, Beijing, China;

    Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China,School of Information, Renmin University of China, Beijing, China;

    Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China,School of Information, Renmin University of China, Beijing, China;

    Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China,School of Information, Renmin University of China, Beijing, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    duplicate detection; MapReduce; cloud;

    机译:重复检测;MapReduce;云;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号