A method for Linking Multiple De-identified Datasets

Andrew Waugh; David Rowley; Auren Clarke

首页> 外文期刊>International Journal of Population Data Science >A method for Linking Multiple De-identified Datasets

【24h】

A method for Linking Multiple De-identified Datasets

机译：链接多个未标识数据集的方法

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

BackgroundNational Statistics Institutes have been exploring the value of using administrative data. The Administrative Data Team within the Scotland’s Census 2021 Programme are exploring bringing administrative datasets together to support the censusand produce alternative population estimates. ObjectivesWe are developing methods to link de-identified administrative datasets, drawing on existing methods. MethodsOur method uses hashed linking variables, derived from name, address, date of birth and gender. One linking variable is a names correction, produced by comparing names to each name in a reference set and scoring the difference. The scoring algorithm developed considers transpositions, deletions, insertions, substitutions and moves, and is sensitive to the particular letters involved. Linking variables are combined at run time to produce thousands of matchkeys, allowing more matches to be linked deterministically using hashed data. Overall link strength scores are calculated as a combination of: Penalties associated with the matchkey, based on the linking variables used, and Similarity on dates of birth, measured at run time using weighted Bloom Filters. We concatenate all the datasets and link the resulting dataset to itself. This allows simultaneous linking across all datasets and resolution of duplicate records within each dataset. This results in potentially complex patterns of links. By considering the records and links as a graph we allocaterecords to unique individuals through a vertex colouring algorithm on the complement of each component. The link strength is considered to prioritize allocation. FindingsClerical review on links made found that those with stronger scores were more likely to be considered a match. ConclusionsThis linking method is being used and tested further in linking admin datasets for population estimates. We also plan to use it for several linking tasks in the processing of Scotland’s Census 2021.

机译：背景国家统计局一直在探索使用行政数据的价值。苏格兰2021年人口普查计划中的行政数据小组正在探索将行政数据集在一起以支持人口普查并得出替代的人口估计数。目标我们正在利用现有方法开发链接未标识的管理数据集的方法。方法我们的方法使用从名称，地址，出生日期和性别得出的哈希链接变量。一个链接变量是名称更正，它是通过将名称与参考集中的每个名称进行比较并对差异进行评分而产生的。开发的计分算法考虑了换位，删除，插入，替换和移动，并且对涉及的特定字母敏感。链接变量在运行时进行组合以生成数千个匹配键，从而允许使用散列数据确定性地链接更多匹配项。整体链接强度得分的组合计算如下：基于所使用的链接变量的与匹配键相关的惩罚以及在运行时使用加权Bloom过滤器测量的出生日期相似度。我们连接所有数据集并将结果数据集链接到自身。这样可以同时链接所有数据集并解析每个数据集中的重复记录。这会导致潜在的复杂链接模式。通过将记录和链接视为图形，我们通过每个组件的补色上的顶点着色算法将记录分配给唯一的个体。链接强度被视为优先分配。在对链接的文书审查中发现，得分更高的链接更有可能被视为匹配。结论此链接方法在链接admin数据集以进行人口估计时正在使用和进一步测试。我们还计划在处理苏格兰2021年人口普查时将其用于多个链接任务。

著录项

来源
《International Journal of Population Data Science》 |2018年第2期|共页
作者
Andrew Waugh; David Rowley; Auren Clarke;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类经济;
关键词

相似文献

外文文献
中文文献
专利

1. Scalable Methods for Measuring the Connectivity and Quality of Large Numbers of Linked Datasets [J] . MICHALIS MOUNTANTONAKIS, YANNIS TZITZIKAS ACM journal of data and information quality . 2018,第3期

机译：用于测量大量链接数据集的连接和质量的可扩展方法
2. Methods for dealing with discrepant records in linked population health datasets: a cross-sectional study [J] . Christine L Roberts, Charles S Algert, Jane B Ford BMC Health Services Research . 2007,第1期

机译：链接的人口健康数据集中处理差异记录的方法：一项横断面研究
3. Monitoring health service use at the end of life in the Calgary Zone of Alberta: a Population-level analysis linking multiple administrative datasets [J] . Pin Cai, Andrew Fong, Aynharan Sinnarajah International Journal of Population Data Science . 2018,第4期

机译：监测艾伯塔省卡尔加里地区生命终结时的卫生服务使用：链接多个行政数据集的人口水平分析
4. Integration of Multiple Graph Datasets and Their Linguistic Summaries: An Application to Linked Data [C] . Lukasz Strobin, Adam Niewiadomski International conference on artificial intelligence and soft computing . 2016

机译：多个图数据集及其语言摘要的集成：链接数据的应用
5. New Statistical Learning Methods for Multiple High Dimensional Datasets. [D] . Lee, Wonyul. 2013

机译：多个高维数据集的新统计学习方法。
6. Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes [O] . Boris P. Hejblum, Griffin M. Weber, Katherine P. Liao, 2019

机译：使用诊断代码将具有差异的去识别研究数据集的概率记录链接
7. Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes [O] . Boris P. Hejblum, Griffin M. Weber, Katherine P. Liao, 2019

机译：使用诊断码差异差异差异识别研究数据集的概率记录联动

A method for Linking Multiple De-identified Datasets

摘要

著录项

相似文献

相关主题

期刊订阅