首页> 外文学位 >Learnable similarity functions and their application to record linkage and clustering
【24h】

Learnable similarity functions and their application to record linkage and clustering

机译:可学习的相似性函数及其在记录链接和聚类中的应用

获取原文
获取原文并翻译 | 示例

摘要

Many machine learning and data mining tasks depend on functions that estimate similarity between instances. Similarity computations are particularly important in clustering and information integration applications, where pairwise distances play a central role in many algorithms. Typically, algorithms for these tasks rely on pre-defined similarity measures, such as edit distance or cosine similarity for strings, or Euclidean distance for vector-space data. However, standard distance functions are frequently suboptimal as they do not capture the appropriate notion of similarity for a particular domain, dataset, or application.;In this thesis, we present several approaches for addressing this problem by employing learnable similarity functions. Given supervision in the form of similar or dis-similar pairs of instances, learnable similarity functions can be trained to provide accurate estimates for the domain and task at hand. We study the problem of adapting similarity functions in the context of several tasks: record linkage, clustering, and blocking. For each of these tasks, we present learnable similarity functions and training algorithms that lead to improved performance.;In record linkage, also known as duplicate detection and entity matching, the goal is to identify database records referring to the same underlying entity. This requires estimating similarity between corresponding field values of records, as well as overall similarity between records. For computing field-level similarity between strings, we describe two learnable variants of edit distance that lead to improvements in linkage accuracy. For learning record-level similarity functions, we employ Support Vector Machines to combine similarities of individual record fields in proportion to their relative importance, yielding a high-accuracy linkage system. We also investigate strategies for efficient collection of training data which can be scarce due to the pairwise nature of the record linkage task.;In clustering, similarity functions are essential as they determine the grouping of instances that is the goal of clustering. We describe a framework for integrating learnable similarity functions within a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs). The framework accommodates learning various distance measures, including those based on Bregman divergences (e.g., parameterized Mahalanobis distance and parameterized KL-divergence), as well as directional measures (e.g., cosine similarity). Thus, it is applicable to a wide range of domains and data representations. Similarity functions are learned within the HMRF-KMEANS algorithm derived from the framework, leading to significant improvements in clustering accuracy.;The third application we consider, blocking, is critical in making record linkage and clustering algorithms scalable to large datasets, as it facilitates efficient selection of approximately similar instance pairs without explicitly considering all possible pairs. Previously proposed blocking methods require manually constructing a similarity function or a set of similarity predicates, followed by hand-tuning of parameters. We propose learning blocking functions automatically from linkage and semi-supervised clustering supervision, which allows automatic construction of blocking methods that are efficient and accurate. This approach yields computationally cheap learnable similarity functions that can be used for scaling up in a variety of tasks that rely on pairwise distance computations, including record linkage and clustering.
机译:许多机器学习和数据挖掘任务取决于估计实例之间相似性的功能。在群集和信息集成应用中,成对距离在许多算法中起着核心作用,相似度计算尤其重要。通常,用于这些任务的算法依赖于预定义的相似性度量,例如字符串的编辑距离或余弦相似性,或矢量空间数据的欧几里得距离。但是,标准距离函数通常不是最佳的,因为它们没有捕获特定域,数据集或应用程序的适当相似性概念。在本文中,我们提出了几种通过采用可学习的相似性函数来解决此问题的方法。在以相似或不相似的实例对的形式进行监督的情况下,可以训练可学习的相似性函数,以提供针对当前域和手头任务的准确估计。我们研究在几个任务的上下文中适应相似性功能的问题:记录链接,聚类和阻止。对于这些任务中的每一个,我们提出可学习的相似性函数和训练算法,以提高性能。在记录链接(也称为重复检测和实体匹配)中,目标是识别引用相同基础实体的数据库记录。这需要估计记录的相应字段值之间的相似性,以及记录之间的整体相似性。为了计算字符串之间的字段级相似度,我们描述了两个可学习的编辑距离变体,它们可以提高链接精度。为了学习记录级别的相似性函数,我们使用支持向量机将各个记录字段的相似性按其相对重要性进行组合,从而产生一个高精度的链接系统。我们还研究了有效收集训练数据的策略,由于记录链接任务的成对性质,这种策略可能是稀缺的。在聚类中,相似性函数是必不可少的,因为它们确定聚类的目标是对实例进行分组。我们描述了一个框架,用于在基于隐马尔可夫随机场(HMRF)的半监督聚类概率模型内集成可学习的相似性函数。该框架可容纳学习各种距离量度,包括基于Bregman散度的量度(例如,参数化的Mahalanobis距离和参数化的KL-散度)以及方向性量度(例如,余弦相似度)。因此,它适用于广泛的域和数据表示。在从框架衍生的HMRF-KMEANS算法中学习了相似性函数,从而大大提高了聚类准确性。我们考虑的第三个应用程序阻塞对于使记录链接和聚类算法可扩展到大型数据集至关重要,因为它有助于提高效率。选择近似相似的实例对,而无需明确考虑所有可能的对。先前提出的阻塞方法需要手动构造相似性函数或一组相似性谓词,然后手动调整参数。我们建议从链接和半监督聚类监督中自动学习阻塞功能,从而可以自动构造有效且准确的阻塞方法。这种方法产生了廉价的可学习的相似性函数,可用于按比例增加成对依赖于成对距离计算的各种任务,包括记录链接和聚类。

著录项

  • 作者

    Bilenko, Mikhail Yuryevich.;

  • 作者单位

    The University of Texas at Austin.;

  • 授予单位 The University of Texas at Austin.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 152 p.
  • 总页数 152
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号