首页> 外文学位 >Human-Centric Debugging of Entity Matching.
【24h】

Human-Centric Debugging of Entity Matching.

机译:实体匹配的以人为中心的调试。

获取原文
获取原文并翻译 | 示例

摘要

Entity matching (EM) is the problem of finding data records that refer to the same real-world entity. For example, the two records (Matthew Richardson, 206-453-1978) and (Matt W. Richardson, 453 1978) may refer to the same person. It is an important data integration problem with many applications such as in e-commerce, healthcare, and national security. Recent work on entity matching has focused on using machine learning and/or crowdsourcing in order to improve accuracy and/or scale the current matching solutions despite the fact that this task is typically done with a human analyst in the loop. Therefore, in this thesis we propose to work on solutions that acknowledge that humans are in the loop for completing an entity matching task. We focus on debugging of entity matching, which is an iterative process by which an analyst improves matching quality. Hence the title, "Human-Centric Debugging of Entity Matching''.;We build an end-to-end matching system and experiment with it in an e-commerce setting as well as with students in a graduate data modeling course at UW-Madison. We also develop an abstract model of the entity matching problem for an analyst to understand what makes an entity matching problem hard for an analyst. The insights learned in the above work lead to the following works in the rest of the thesis: First, we focus on debugging rule-based matchers and we attempt to make it an interactive process by which an analyst can quickly iterate and find a high quality matcher. We show that by optimally ordering the rules as well as incrementally running the matcher on top of previous matching output we can decrease runtime significantly. And second, we focus on debugging of entity matching data sets. We develop a framework to help an analyst quickly find and resolve inconsistencies in a data set. We experiment with seven real-world data sets and demonstrate the effectiveness of our framework in finding inconsistencies.
机译:实体匹配(EM)是查找引用同一真实世界实体的数据记录的问题。例如,两条记录(Matthew Richardson,206-453-1978)和(Matt W. Richardson,453 1978)可能是指同一个人。对于许多应用程序来说,这是一个重要的数据集成问题,例如在电子商务,医疗保健和国家安全中。尽管事实通常是在回路中由人工分析人员完成的,但有关实体匹配的最新工作已集中于使用机器学习和/或众包以提高准确性和/或扩展当前的匹配解决方案。因此,在本论文中,我们提议研究解决方案,这些解决方案承认人类处于完成实体匹配任务的循环中。我们专注于实体匹配的调试,这是一个迭代过程,分析师可以通过该过程提高匹配质量。因此,标题为“以人为中心的实体匹配调试”。;我们构建了端到端的匹配系统,并在电子商务环境中对其进行了实验,并在UW-的研究生数据建模课程中与学生进行了实验麦迪逊(Madison),我们还为分析师建立了实体匹配问题的抽象模型,以了解是什么使分析师难以解决实体匹配问题,在上述工作中获得的见解导致了本论文的其余部分:我们专注于调试基于规则的匹配器,并尝试使其成为一个交互式过程,分析人员可以通过该过程快速迭代并找到高质量的匹配器,这表明通过最佳排序规则以及在先前的基础上递增运行匹配器匹配输出,可以显着减少运行时间;其次,我们专注于实体匹配数据集的调试;我们开发了一个框架来帮助分析师快速找到并解决数据集中的不一致性;我们尝试了七个世界数据集,并证明我们的框架在发现不一致之处方面的有效性。

著录项

  • 作者

    Panahi, Fatemah.;

  • 作者单位

    The University of Wisconsin - Madison.;

  • 授予单位 The University of Wisconsin - Madison.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 151 p.
  • 总页数 151
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号