首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Principled Graph Matching Algorithms for Integrating Multiple Data Sources
【24h】

Principled Graph Matching Algorithms for Integrating Multiple Data Sources

机译:集成多个数据源的原理图匹配算法

获取原文
获取原文并翻译 | 示例

摘要

This paper explores combinatorial optimization for problems of max-weight graph matching on multi-partite graphs, which arise in integrating multiple data sources. In the most common two-source case, it is often desirable for the final matching to be one-to-one; the database and statistical record linkage communities accomplish this by weighted bipartite graph matching on similarity scores. Such matchings are intuitively appealing: they leverage a natural global property of many real-world entity stores—that of being nearly deduped—and are known to provide significant improvements to precision and recall. Unfortunately, unlike the bipartite case, exact max-weight matching on multi-partite graphs is known to be NP-hard. Our two-fold algorithmic contributions approximate multi-partite max-weight matching: our first algorithm borrows optimization techniques common to Bayesian probabilistic inference; our second is a greedy approximation algorithm. In addition to a theoretical guarantee on the latter, we present comparisons on a real-world entity resolution problem from Bing significantly larger than typically found in the literature, on publication data, and on a series of synthetic problems. Our results quantify significant improvements due to exploiting multiple sources, which are made possible by global one-to-one constraints linking otherwise independent matching sub-problems. We also discover that our algorithms are complementary: one being much more robust under noise, and the other being simple to implement and very fast to run.
机译:本文探讨了针对多部分图的最大权重图匹配问题的组合优化,该问题是在集成多个数据源时出现的。在最常见的两源情况下,通常希望最终匹配是一对一的。数据库和统计记录链接社区通过对相似性分数进行加权的二部图匹配来实现此目的。这种匹配在直观上很吸引人:它们利用了许多现实世界实体商店的自然全局属性(几乎已被重复数据删除),并且已知可以大大提高准确性和召回率。不幸的是,与二分情况不同,已知多分图上的精确最大权重匹配是NP难的。我们的双重算法贡献近似于多部分最大权重匹配:我们的第一个算法借用了贝叶斯概率推断所通用的优化技术;我们的第二个是贪婪近似算法。除了对后者的理论保证外,我们还对来自Bing的现实世界中的实体解决问题进行了比较,该问题明显大于文献中通常所发现的,关于出版物数据以及一系列综合问题的比较。我们的结果量化了由于开发多个源而产生的重大改进,这是通过全局一对一约束将原本独立的匹配子问题链接起来而实现的。我们还发现我们的算法是互补的:一种算法在噪声下更为健壮,另一种则易于实现且运行速度非常快。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号