...
首页> 外文期刊>Big Data Research >Entity Resolution with Recursive Blocking
【24h】

Entity Resolution with Recursive Blocking

机译:具有递归阻塞的实体分辨率

获取原文
获取原文并翻译 | 示例

摘要

Entity resolution is a well-known challenge in data management for the lack of unique identifiers of records and various errors hidden in the data, undermining the identifiability of entities they refer to. To reveal matching records, every record potentially needs to be compared with all other records in the database, which is computationally intractable even for moderately-sized databases. To circumvent this quadratic challenge, blocking methods are typically employed to facilitate restricting promising comparisons of pairs within small subsets, called blocks, of records. Existing effective methods typically rely on blocking keys created by experts to capture matches, which inevitably involves a large amount of human labor and do not guarantee high-quality results. To reduce manual labor and promote accuracy, machine learning approaches are investigated to meet the challenge with limited success, due to high requirements of training data and inefficiency, especially for large databases. The exhaustive method produces exact results but suffers from efficiency problems. In this paper, we propose a paradigm of divide-and-conquer entity resolution, named recursive blocking, which derives comparatively good results while largely alleviating efficiency concerns. Specifically, recursive blocking refines blocks and traps matches in an iterative fashion to derive high-quality results, and we study two types of recursive blocking, i.e. redundancy- and partition-based approaches, and investigate their relative performance. Comprehensive experiments on both real-world and synthetic datasets verified the superiority of our approaches over the existing ones. (C) 2020 Elsevier Inc. All rights reserved.
机译:实体分辨率是数据管理中缺乏唯一标识符的挑战,以及数据中隐藏的各种错误,破坏了他们所指的实体的可识别性。为了揭示匹配记录,每个记录都需要与数据库中的所有其他记录进行比较,即使对于中等大小的数据库,也是计算地难以解决的。为了避免这种二次挑战,通常采用阻断方法来促进限制对记录的小亚集合中的对对的有希望的对的比较。现有的有效方法通常依赖于专家创建的阻塞密钥来捕获匹配,这不可避免地涉及大量人工,并不保证高质量的结果。为了减少体力劳动和促进准确性,由于培训数据和效率低廉的要求,研究了机器学习方法,以满足有限的成功挑战,特别是对于大型数据库。详尽的方法产生确切的结果,但遭受了效率问题。在本文中,我们提出了一种分裂和征服实体分辨率的范例,命名递归阻断,其导致相对良好的结果,同时在很大程度上减轻了效率问题。具体而言,递归阻塞精制块和陷阱以迭代方式匹配,以获得高质量的结果,并研究两种类型的递归阻塞,即冗余和基于分区的方法,并调查它们的相对性能。关于现实世界和合成数据集的综合实验验证了我们对现有的方法的优势。 (c)2020 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号