首页> 外文会议>Pacific-Asia Conference on Knowledge Discovery and Data Mining >Determining the Impact of Missing Values on Blocking in Record Linkage
【24h】

Determining the Impact of Missing Values on Blocking in Record Linkage

机译:确定缺失值对记录链接中的阻塞的影响

获取原文

摘要

Record linkage is the process of integrating information from the same underlying entity across disparate data sets. This process, which is increasingly utilized to build accurate representations of individuals and organizations for a variety of applications, ranging from credit worthiness assessments to continuity of medical care, can be computationally intensive because it requires comparing large quantities of records over a range of attributes. To reduce the amount of computation in record linkage in big data settings, blocking methods, which are designed to limit the number of record pair comparisons that needs to be performed, are critical for scaling up the record linkage process. These methods group together potential matches into blocks, often using a subset of attributes before a final comparator function predicts which record pairs within the blocks correspond to matches. Yet data corruption and missing values adversely influence the performance of blocking methods (e.g., it may cause some matching records not to be placed in the same block). While there has been some investigation into the impact of missing values on general record linkage techniques (e.g., the comparator function), no study has addressed the impact of the missing values on blocking methods. To address this issue, in this work, we systematically perform a detailed empirical analysis of the individual and joint impact of missing values and data corruption on different blocking methods using realistic data sets. Our results show that blocking approaches that do not depend on one type of blocking attributes are more robust against missing values. In addition, our results indicate that blocking parameters must be chosen carefully for different blocking techniques.
机译:记录链接是整合来自不同数据集的同一基础实体的信息的过程。此过程越来越多地用于为各种应用建立个人和组织的准确表示,从信用价值评估到医疗保健的连续性,由于需要比较大量属性上的大量记录,因此该过程的计算量很大。为了减少大数据设置中记录链接的计算量,旨在限制需要执行的记录对比较次数的阻塞方法对于扩大记录链接过程至关重要。这些方法通常在最终比较器功能预测块内的哪些记录对与匹配相对应之前,使用属性子集将潜在的匹配项分组为多个块。然而,数据损坏和值丢失会对阻塞方法的性能产生不利影响(例如,可能导致某些匹配记录不放在同一块中)。虽然已经对缺失值对通用记录链接技术(例如比较器功能)的影响进行了一些调查,但尚无研究解决缺失值对阻塞方法的影响。为了解决这个问题,在这项工作中,我们使用现实的数据集,系统地对缺失值和数据损坏对不同阻止方法的单个和联合影响进行了详细的经验分析。我们的结果表明,不依赖一种阻塞属性的阻塞方法对于丢失值更健壮。此外,我们的结果表明,必须针对不同的阻塞技术仔细选择阻塞参数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号