首页> 外文期刊>International Journal of Population Data Science >A method for Linking Multiple De-identified Datasets
【24h】

A method for Linking Multiple De-identified Datasets

机译:链接多个未标识数据集的方法

获取原文
           

摘要

BackgroundNational Statistics Institutes have been exploring the value of using administrative data. The Administrative Data Team within the Scotland’s Census 2021 Programme are exploring bringing administrative datasets together to support the censusand produce alternative population estimates. ObjectivesWe are developing methods to link de-identified administrative datasets, drawing on existing methods. MethodsOur method uses hashed linking variables, derived from name, address, date of birth and gender. One linking variable is a names correction, produced by comparing names to each name in a reference set and scoring the difference. The scoring algorithm developed considers transpositions, deletions, insertions, substitutions and moves, and is sensitive to the particular letters involved. Linking variables are combined at run time to produce thousands of matchkeys, allowing more matches to be linked deterministically using hashed data. Overall link strength scores are calculated as a combination of: Penalties associated with the matchkey, based on the linking variables used, and Similarity on dates of birth, measured at run time using weighted Bloom Filters. We concatenate all the datasets and link the resulting dataset to itself. This allows simultaneous linking across all datasets and resolution of duplicate records within each dataset. This results in potentially complex patterns of links. By considering the records and links as a graph we allocaterecords to unique individuals through a vertex colouring algorithm on the complement of each component. The link strength is considered to prioritize allocation. FindingsClerical review on links made found that those with stronger scores were more likely to be considered a match. ConclusionsThis linking method is being used and tested further in linking admin datasets for population estimates. We also plan to use it for several linking tasks in the processing of Scotland’s Census 2021.
机译:背景国家统计局一直在探索使用行政数据的价值。苏格兰2021年人口普查计划中的行政数据小组正在探索将行政数据集在一起以支持人口普查并得出替代的人口估计数。目标我们正在利用现有方法开发链接未标识的管理数据集的方法。方法我们的方法使用从名称,地址,出生日期和性别得出的哈希链接变量。一个链接变量是名称更正,它是通过将名称与参考集中的每个名称进行比较并对差异进行评分而产生的。开发的计分算法考虑了换位,删除,插入,替换和移动,并且对涉及的特定字母敏感。链接变量在运行时进行组合以生成数千个匹配键,从而允许使用散列数据确定性地链接更多匹配项。整体链接强度得分的组合计算如下:基于所使用的链接变量的与匹配键相关的惩罚以及在运行时使用加权Bloom过滤器测量的出生日期相似度。我们连接所有数据集并将结果数据集链接到自身。这样可以同时链接所有数据集并解析每个数据集中的重复记录。这会导致潜在的复杂链接模式。通过将记录和链接视为图形,我们通过每个组件的补色上的顶点着色算法将记录分配给唯一的个体。链接强度被视为优先分配。在对链接的文书审查中发现,得分更高的链接更有可能被视为匹配。结论此链接方法在链接admin数据集以进行人口估计时正在使用和进一步测试。我们还计划在处理苏格兰2021年人口普查时将其用于多个链接任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号