Determining the Impact of Missing Values on Blocking in Record Linkage

机译：确定缺失值对记录链接中的阻塞的影响

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Record linkage is the process of integrating information from the same underlying entity across disparate data sets. This process, which is increasingly utilized to build accurate representations of individuals and organizations for a variety of applications, ranging from credit worthiness assessments to continuity of medical care, can be computationally intensive because it requires comparing large quantities of records over a range of attributes. To reduce the amount of computation in record linkage in big data settings, blocking methods, which are designed to limit the number of record pair comparisons that needs to be performed, are critical for scaling up the record linkage process. These methods group together potential matches into blocks, often using a subset of attributes before a final comparator function predicts which record pairs within the blocks correspond to matches. Yet data corruption and missing values adversely influence the performance of blocking methods (e.g., it may cause some matching records not to be placed in the same block). While there has been some investigation into the impact of missing values on general record linkage techniques (e.g., the comparator function), no study has addressed the impact of the missing values on blocking methods. To address this issue, in this work, we systematically perform a detailed empirical analysis of the individual and joint impact of missing values and data corruption on different blocking methods using realistic data sets. Our results show that blocking approaches that do not depend on one type of blocking attributes are more robust against missing values. In addition, our results indicate that blocking parameters must be chosen carefully for different blocking techniques.

机译：记录链接是整合来自不同数据集的同一基础实体的信息的过程。此过程越来越多地用于为各种应用建立个人和组织的准确表示，从信用价值评估到医疗保健的连续性，由于需要比较大量属性上的大量记录，因此该过程的计算量很大。为了减少大数据设置中记录链接的计算量，旨在限制需要执行的记录对比较次数的阻塞方法对于扩大记录链接过程至关重要。这些方法通常在最终比较器功能预测块内的哪些记录对与匹配相对应之前，使用属性子集将潜在的匹配项分组为多个块。然而，数据损坏和值丢失会对阻塞方法的性能产生不利影响（例如，可能导致某些匹配记录不放在同一块中）。虽然已经对缺失值对通用记录链接技术（例如比较器功能）的影响进行了一些调查，但尚无研究解决缺失值对阻塞方法的影响。为了解决这个问题，在这项工作中，我们使用现实的数据集，系统地对缺失值和数据损坏对不同阻止方法的单个和联合影响进行了详细的经验分析。我们的结果表明，不依赖一种阻塞属性的阻塞方法对于丢失值更健壮。此外，我们的结果表明，必须针对不同的阻塞技术仔细选择阻塞参数。

著录项

来源
《Pacific-Asia Conference on Knowledge Discovery and Data Mining》|2019年|262-274|共13页
会议地点
作者
Imrul Chowdhury Anindya; Murat Kantarcioglu; Bradley Malm;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Record linkage; Deduplication; Missing values; Blocking methods; Data corruption;

机译：记录链接;重复数据删除;缺少值;阻塞方法;资料损坏;

相似文献

外文文献
中文文献
专利

1. Privacy preserving record linkage in the presence of missing values [J] . Chi Yuan, Hong Jun, Jurek Anna, Information Systems . 2017,第nova期

机译：在缺少值的情况下保持隐私保护记录链接
2. Reducing the Impact of Missing Values in Factorial Experiments Arranged in Blocks [J] . J. D. Godolphin Quality and Reliability Engineering International . 2006,第6期

机译：减少按块排列的阶乘实验中缺失值的影响
3. What is the impact of missing Indigenous status on mortality estimates? An assessment using record linkage in Western Australia. [J] . Draper GK, Somerford PJ, Pilkington AS Australian and New Zealand journal of public health. . 2009,第4期

机译：丧失土著身份对估计死亡率有什么影响？在西澳大利亚州使用记录链接进行评估。
4. Determining the Impact of Missing Values on Blocking in Record Linkage [C] . Imrul Chowdhury Anindya, Murat Kantarcioglu, Bradley Malm Pacific-Asia Conference on Knowledge Discovery and Data Mining . 2019

机译：确定缺失值对录制联动中封锁的影响
5. A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage [D] . Ranbaduge, Thilina. 2018

机译：多数据库隐私保护记录链接的可扩展阻止框架
6. The Use of Record Linkage to Determine Familial Occurence of Disease from Hospital Records (Hashimotos Disease) [O] . Alfonse T. Masi, Philip E. Sartwell, Lawrence E. Shulman 1964

机译：使用记录链接从医院记录（桥本氏病）确定家族性疾病的发生
7. Fehlende Daten beim Record Linkage von Prozess- und Befragungsdaten : ein empirischer Vergleich ausgewählter Missing Data Techniken (Missing data in the record linkage of process and survey data : An empirical comparison of selected missing data techniques) [O] . Krug Gerhard 100

机译：Fehlende Daten beim Record linkage von prozess- und Befragungsdaten：ein empirischerVergleichususwähltermissingData Techniken（缺少过程和调查数据记录链接中的数据：选定缺失数据技术的经验比较）
8. Assessment of the Impact of Missing Values in the Southwest Residential Experiment Photovoltaic Array Data Records [R] . Hall, I. J., Menicucci, D. F., Frost, E. L. 1985

机译：评估西南住宅实验光伏阵列数据记录中缺失值的影响

Determining the Impact of Missing Values on Blocking in Record Linkage

摘要

著录项

相似文献

相关主题

期刊订阅