【24h】

Probabilistic Iterative Duplicate Detection

机译:概率迭代重复检测

获取原文
获取原文并翻译 | 示例

摘要

The problem of identifying approximately duplicate records between databases is known, among others, as duplicate detection or record linkage. To this end, typically either rules or a weighted aggregation of distances between the individual attributes of potential duplicates is used. However, choosing the appropriate rules, distance functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. In this paper we present an unsupervised, domain independent approach that starts with a broad alignment of potential duplicates, and analyses the distribution of observed distances among potential duplicates and among non-duplicates to iteratively refine the initial alignment. Evaluations show that this approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.
机译:识别数据库之间的近似重复记录的问题尤其被称为重复检测或记录链接。为此,通常使用规则或潜在重复项的各个属性之间的距离的加权聚合。但是,选择适当的规则,距离函数,权重和阈值需要对应用程序领域有深入的了解,或者需要对有监督的学习方法有良好的代表性培训。在本文中,我们提出了一种无监督的,独立于域的方法,该方法从潜在重复项的广泛比对入手,并分析潜在重复项与非重复项之间的观察距离分布,以迭代地优化初始对齐方式。评估表明,该方法取代了其他非监督方法,并且达到了与完全监督,依赖域的方法几乎相同的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号