首页> 外文OA文献 >Bayesian Logistic Regression with Jaro-Winkler String Comparator Scores Provides Sizable Improvement in Probabilistic Record Matching
【2h】

Bayesian Logistic Regression with Jaro-Winkler String Comparator Scores Provides Sizable Improvement in Probabilistic Record Matching

机译:带有Jaro-Winkler字符串比较器分数的贝叶斯Logistic回归在概率记录匹配方面提供了可观的改进

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Record matching is a fundamental and ubiquitous part of today?s society. Anything from typing in a password in order to access your email to connecting existing health records in California with new health records in New York requires matching records together. In general, there are two types of record matching algorithms: deterministic, a more rules-based approach, and probabilistic, a model-based approach. Both types have their advantages and disadvantages. If the amount of data is relatively small, deterministic algorithms yield very high success rates. However, the number of common mistakes, and subsequent rules, becomes astronomically large as the sizes of the datasets increase. This leads to a highly labor-intensive process updating and maintaining the matching algorithm. On the other hand, probabilistic record matching implements a mathematical model that can take into account keying mistakes, does not require as much maintenance and over- head, and provides a probability that two particular entities should be linked. At the same time, as a model, assumptions need to be met, fitness has to be assessed, and predictions can be incorrect. Regardless of the type of algorithm, nearly all utilize a 0/1 field-matching structure, including the Fellegi-Sunter algorithm from 1969. That is to say that either the fields match entirely, or they do not match at all. As a result, typographical errors can get lost and false negatives can result. My research has yielded that using Jaro-Winkler string comparator scores as predictors to a Bayesian logistic regression model in lieu of a restrictive binary structure yields marginal improvement over current methodologies.
机译:唱片匹配是当今社会的基本和普遍的组成部分。从键入密码以访问您的电子邮件,到将加利福尼亚州的现有健康记录与纽约的新健康记录相连,任何事情都需要匹配记录。通常,记录匹配算法有两种类型:确定性(基于规则的方法)和概率(基于模型的方法)。两种类型都有其优点和缺点。如果数据量相对较小,则确定性算法会产生很高的成功率。但是,随着数据集大小的增加,常见错误的数量和后续规则在天文上会很大。这导致高度劳动密集型的过程更新和维护匹配算法。另一方面,概率记录匹配实现了一种数学模型,该模型可以考虑键入错误,不需要太多的维护和开销,并提供了两个特定实体应该链接的可能性。同时,作为模型,需要满足假设,必须评估适用性,并且预测可能不正确。无论算法的类型如何,几乎所有算法都使用0/1字段匹配结构,包括1969年的Fellegi-Sunter算法。也就是说,这些字段要么完全匹配,要么根本不匹配。结果,印刷错误可能会丢失,并且可能导致误报。我的研究得出的结论是,使用Jaro-Winkler字符串比较器得分作为贝叶斯逻辑回归模型的预测指标,而不是限制性的二元结构,相对于当前方法而言,其边际改进。

著录项

  • 作者

    Jann Dominic 1983-;

  • 作者单位
  • 年度 2013
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号