首页> 外文学位 >Learnable similarity functions and their application to record linkage and clustering

【24h】

Learnable similarity functions and their application to record linkage and clustering

机译：可学习的相似性函数及其在记录链接和聚类中的应用

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many machine learning and data mining tasks depend on functions that estimate similarity between instances. Similarity computations are particularly important in clustering and information integration applications, where pairwise distances play a central role in many algorithms. Typically, algorithms for these tasks rely on pre-defined similarity measures, such as edit distance or cosine similarity for strings, or Euclidean distance for vector-space data. However, standard distance functions are frequently suboptimal as they do not capture the appropriate notion of similarity for a particular domain, dataset, or application.;In this thesis, we present several approaches for addressing this problem by employing learnable similarity functions. Given supervision in the form of similar or dis-similar pairs of instances, learnable similarity functions can be trained to provide accurate estimates for the domain and task at hand. We study the problem of adapting similarity functions in the context of several tasks: record linkage, clustering, and blocking. For each of these tasks, we present learnable similarity functions and training algorithms that lead to improved performance.;In record linkage, also known as duplicate detection and entity matching, the goal is to identify database records referring to the same underlying entity. This requires estimating similarity between corresponding field values of records, as well as overall similarity between records. For computing field-level similarity between strings, we describe two learnable variants of edit distance that lead to improvements in linkage accuracy. For learning record-level similarity functions, we employ Support Vector Machines to combine similarities of individual record fields in proportion to their relative importance, yielding a high-accuracy linkage system. We also investigate strategies for efficient collection of training data which can be scarce due to the pairwise nature of the record linkage task.;In clustering, similarity functions are essential as they determine the grouping of instances that is the goal of clustering. We describe a framework for integrating learnable similarity functions within a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs). The framework accommodates learning various distance measures, including those based on Bregman divergences (e.g., parameterized Mahalanobis distance and parameterized KL-divergence), as well as directional measures (e.g., cosine similarity). Thus, it is applicable to a wide range of domains and data representations. Similarity functions are learned within the HMRF-KMEANS algorithm derived from the framework, leading to significant improvements in clustering accuracy.;The third application we consider, blocking, is critical in making record linkage and clustering algorithms scalable to large datasets, as it facilitates efficient selection of approximately similar instance pairs without explicitly considering all possible pairs. Previously proposed blocking methods require manually constructing a similarity function or a set of similarity predicates, followed by hand-tuning of parameters. We propose learning blocking functions automatically from linkage and semi-supervised clustering supervision, which allows automatic construction of blocking methods that are efficient and accurate. This approach yields computationally cheap learnable similarity functions that can be used for scaling up in a variety of tasks that rely on pairwise distance computations, including record linkage and clustering.

机译：许多机器学习和数据挖掘任务取决于估计实例之间相似性的功能。在群集和信息集成应用中，成对距离在许多算法中起着核心作用，相似度计算尤其重要。通常，用于这些任务的算法依赖于预定义的相似性度量，例如字符串的编辑距离或余弦相似性，或矢量空间数据的欧几里得距离。但是，标准距离函数通常不是最佳的，因为它们没有捕获特定域，数据集或应用程序的适当相似性概念。在本文中，我们提出了几种通过采用可学习的相似性函数来解决此问题的方法。在以相似或不相似的实例对的形式进行监督的情况下，可以训练可学习的相似性函数，以提供针对当前域和手头任务的准确估计。我们研究在几个任务的上下文中适应相似性功能的问题：记录链接，聚类和阻止。对于这些任务中的每一个，我们提出可学习的相似性函数和训练算法，以提高性能。在记录链接（也称为重复检测和实体匹配）中，目标是识别引用相同基础实体的数据库记录。这需要估计记录的相应字段值之间的相似性，以及记录之间的整体相似性。为了计算字符串之间的字段级相似度，我们描述了两个可学习的编辑距离变体，它们可以提高链接精度。为了学习记录级别的相似性函数，我们使用支持向量机将各个记录字段的相似性按其相对重要性进行组合，从而产生一个高精度的链接系统。我们还研究了有效收集训练数据的策略，由于记录链接任务的成对性质，这种策略可能是稀缺的。在聚类中，相似性函数是必不可少的，因为它们确定聚类的目标是对实例进行分组。我们描述了一个框架，用于在基于隐马尔可夫随机场（HMRF）的半监督聚类概率模型内集成可学习的相似性函数。该框架可容纳学习各种距离量度，包括基于Bregman散度的量度（例如，参数化的Mahalanobis距离和参数化的KL-散度）以及方向性量度（例如，余弦相似度）。因此，它适用于广泛的域和数据表示。在从框架衍生的HMRF-KMEANS算法中学习了相似性函数，从而大大提高了聚类准确性。我们考虑的第三个应用程序阻塞对于使记录链接和聚类算法可扩展到大型数据集至关重要，因为它有助于提高效率。选择近似相似的实例对，而无需明确考虑所有可能的对。先前提出的阻塞方法需要手动构造相似性函数或一组相似性谓词，然后手动调整参数。我们建议从链接和半监督聚类监督中自动学习阻塞功能，从而可以自动构造有效且准确的阻塞方法。这种方法产生了廉价的可学习的相似性函数，可用于按比例增加成对依赖于成对距离计算的各种任务，包括记录链接和聚类。

著录项

作者
Bilenko, Mikhail Yuryevich.;
展开▼
作者单位

The University of Texas at Austin.;

展开▼
授予单位 The University of Texas at Austin.;
学科 Computer science.
学位 Ph.D.
年度 2006
页码 152 p.
总页数 152
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Using clustering to learn distance functions for supervised similarity assessment [J] . Christoph F. Eick, Alain Rouhana, A. Bagherjeiran, Engineering Applications of Artificial Intelligence . 2006,第4期

机译：使用聚类学习距离函数进行监督相似性评估
2. A similarity measure for second order properties of non-stationary functional time series with applications to clustering and testing [J] . van Delft Anne, Dette Holger Bernoulli: official journal of the Bernoulli Society for Mathematical Statistics and Probability . 2021,第1期

机译：非静止功能时间序列的二阶属性的相似性度量与群集和测试的应用
3. GO functional similarity clustering depends on similarity measure, clustering method, and annotation completeness [J] . Meng Liu, Paul D. Thomas BMC Bioinformatics . 2019,第1期

机译：Go功能相似性群集取决于相似度测量，聚类方法和注释完整性
4. Deterministic Record Linkage versus Similarity Functions: a Study in Health Databases from Brazil [C] . Katia Mitiko Firmino Suzuki, Carlos Humberto Porto Filho, Luis Ferno Cozin, MEDINFO . 2013

机译：确定性记录链接与相似性功能：巴西健康数据库的研究
5. A comparison of clustering procedures and similarity measures in creating clusters using warp functions. [D] . Elguindi, Anne Charlotte. 2010

机译：使用warp函数创建聚类时聚类过程和相似性度量的比较。
6. Efficient Record Linkage Algorithms Using Complete Linkage Clustering [O] . Abdullah-Al Mamun, Robert Aseltine, Sanguthevar Rajasekaran -1

机译：使用完整链接聚类的高效记录链接算法
7. Reclink: aplicativo para o relacionamento de bases de dados, implementando o método probabilistic record linkage Reclink: an application for database linkage implementing the probabilistic record linkage method [O] . Kenneth R. de Camargo Jr., Cláudia M. Coeli 2000

机译：Reclink：用于数据库关系的应用程序，实现概率记录链接方法Reclink：用于数据库链接的应用程序，实现概率记录链接方法
8. Utilisation du Clustering Conceptuel pour l'Apprentissage Automatique des Notionsde Fonction, Structure et Comportement en Conception (Using Conceptual Clustering to Learn About Function, Structure, and Behavior in Design) [R] . Alem, L., Maher, M. L. 1991

机译：利用集群概念设计和实现自我意识的功能，结构和竞争概念（使用概念聚类来了解设计中的功能，结构和行为）

Learnable similarity functions and their application to record linkage and clustering

摘要

著录项

相似文献

相关主题

期刊订阅