首页> 外文学位 >Informing, evaluating and automating the record linkage process for reliably combining disparate datasets.
【24h】

Informing, evaluating and automating the record linkage process for reliably combining disparate datasets.

机译:通知,评估和自动化记录链接过程,以可靠地组合不同的数据集。

获取原文
获取原文并翻译 | 示例

摘要

Linkage is the process of determining whether two records belong to the same person by measuring the similarity of the demographic information tied to each record. It is core to maintaining hospital patient lists, enabling information exchange, and combining datasets for research. Probabilistic linkage, a statistical approach that takes into account the reliability and discrimination of demographic values, has become the dominant linkage method. Like many statistical methods, the performance of probabilistic linkage depends on data preparation, estimation of parameters, and properly fitting the problem to the method---all steps that require human intervention. Because of the time and cost required as well as the variability introduced by humans at each of these steps, linkage is still an active area of research.;A qualitative analysis of the linkage between an enterprise data warehouse and a population database was conducted to compare the frequency of demographic values in the two datasets with a set of records identified as potential duplicates. Doing so allowed patterns to be identified where some frequencies deviated from what was expected and facilitated the creation of a list of simple tools and recommendations that can be used to ensure reliability when undertaking a record linkage project.;An extension of probabilistic record linkage is introduced that preserves the statistical foundation of the method and uses additional information available from demographic values that partially match due to misspellings or typographical errors. Use of this extension results in a 25% reduction in misclassified records and out performs the traditional method regardless of which demographics fields are used for comparison or which cutoff is used for partial matches.;The final study conducted was a quantitative analysis of the impact that data quality and completeness have on record linkage. Using a set of records known to exist in two datasets, this analysis enabled the cause of missed record matches to be determined.;This work investigates how the set of records that are determined to match is affected by characteristics of the datasets, proposes methods to simplify and further automate parameter estimation, and explores how linkage can be evaluated for completeness and accuracy.
机译:链接是通过测量与每个记录相关的人口统计信息的相似性来确定两个记录是否属于同一个人的过程。它是维护医院患者名单,实现信息交换以及组合数据集进行研究的核心。概率联系是一种统计方法,它考虑了人口统计数据的可靠性和歧视性,已成为主要的联系方法。像许多统计方法一样,概率链接的性能取决于数据准备,参数估计以及是否使问题适合该方法-所有这些步骤都需要人工干预。由于所需的时间和成本以及人类在每个步骤中引入的可变性,链接仍然是研究的活跃领域。;对企业数据仓库和人口数据库之间的链接进行了定性分析,以进行比较两个数据集中的人口统计值频率,其中一组记录被标识为潜在重复项。这样做可以在某些频率偏离预期频率的情况下识别模式,并有助于创建一系列简单的工具和建议,以确保在执行记录链接项目时的可靠性。;引入了概率记录链接的扩展保留了该方法的统计基础,并使用了由于拼写错误或印刷错误而部分匹配的人口统计值中可用的其他信息。使用此扩展名可以减少25%的错误分类记录,并且无论使用哪个人口统计领域进行比较还是使用哪个分界用于部分匹配,都可以执行传统方法;最终进行的研究是对影响的定量分析数据质量和完整性具有记录联系。使用已知存在于两个数据集中的一组记录,此分析可以确定丢失记录匹配的原因。这项工作调查了确定为匹配的记录集如何受到数据集特征的影响,提出了一些方法来简化并进一步使参数估计自动化,并探索如何评估链接的完整性和准确性。

著录项

  • 作者

    DuVall, Scott Leroy.;

  • 作者单位

    The University of Utah.;

  • 授予单位 The University of Utah.;
  • 学科 Biology Bioinformatics.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 125 p.
  • 总页数 125
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号