首页> 外文期刊>Frontiers of computer science in China >MapReduce-based entity matching with multiple blocking functions
【24h】

MapReduce-based entity matching with multiple blocking functions

机译:基于MapReduce的实体与多个阻止功能匹配

获取原文
获取原文并翻译 | 示例
           

摘要

Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known as the blocking-based method, tends to select a small proportion of record pairs for verification with a far lower cost than O(n~2), where n is the size of the data set. Furthermore, executing multiple blocking functions independently is critical since much more matching records can be found in this way, so that the quality of the query result can be improved significantly. It is popular to use the MapReduce (MR) framework to improve the performance and the scalability of some complicated queries by running a lot of map (/reduce) tasks in parallel. However, entity matching upon the MapReduce framework is non-trivial due to two inevitable challenges: load balancing and pair deduplication. In this paper, we propose a novel solution, called MrEm, to handle these challenges with the support of multiple blocking functions. Although the existing work can deal with load balancing and pair deduplication respectively, it still cannot deal with both challenges at the same time. Theoretical analysis and experimental results upon real and synthetic data sets illustrate the high effectiveness and efficiency of our proposed solutions.
机译:数十年来,研究了旨在查找属于同一真实对象的某些记录的实体匹配。为了避免验证海量数据集中的每对记录,一种常用的方法(称为基于块的方法)倾向于选择一小部分记录对进行验证,其成本远低于O(n〜2) ,其中n是数据集的大小。此外,独立执行多个阻止功能至关重要,因为这样可以找到更多匹配记录,从而可以显着提高查询结果的质量。使用MapReduce(MR)框架通过并行运行许多地图(/ reduce)任务来提高某些复杂查询的性能和可伸缩性是很普遍的。但是,由于两个不可避免的挑战:负载平衡和成对重复数据删除,在MapReduce框架上进行实体匹配并不容易。在本文中,我们提出了一种新颖的解决方案,称为MrEm,可以在多个阻止功能的支持下应对这些挑战。尽管现有工作可以分别处理负载平衡和成对重复数据删除,但仍不能同时应对这两个挑战。对真实和综合数据集的理论分析和实验结果说明了我们提出的解决方案的高效性和有效性。

著录项

  • 来源
    《Frontiers of computer science in China》 |2017年第5期|895-911|共17页
  • 作者单位

    Institute for Data Science and Engineering, School of Computer Science and Software Engineering, East China Normal University, Shanghai 200062, China;

    Institute for Data Science and Engineering, School of Computer Science and Software Engineering, East China Normal University, Shanghai 200062, China;

    Institute for Data Science and Engineering, School of Computer Science and Software Engineering, East China Normal University, Shanghai 200062, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    entity matching; MapReduce; load balancing; pair deduplication;

    机译:实体匹配;MapReduce;负载均衡;对重复数据删除;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号