Principled Graph Matching Algorithms for Integrating Multiple Data Sources

Zhang Duo; Rubinstein Benjamin I. P.; Gemmell Jim

首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Principled Graph Matching Algorithms for Integrating Multiple Data Sources

【24h】

Principled Graph Matching Algorithms for Integrating Multiple Data Sources

机译：集成多个数据源的原理图匹配算法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This paper explores combinatorial optimization for problems of max-weight graph matching on multi-partite graphs, which arise in integrating multiple data sources. In the most common two-source case, it is often desirable for the final matching to be one-to-one; the database and statistical record linkage communities accomplish this by weighted bipartite graph matching on similarity scores. Such matchings are intuitively appealing: they leverage a natural global property of many real-world entity stores—that of being nearly deduped—and are known to provide significant improvements to precision and recall. Unfortunately, unlike the bipartite case, exact max-weight matching on multi-partite graphs is known to be NP-hard. Our two-fold algorithmic contributions approximate multi-partite max-weight matching: our first algorithm borrows optimization techniques common to Bayesian probabilistic inference; our second is a greedy approximation algorithm. In addition to a theoretical guarantee on the latter, we present comparisons on a real-world entity resolution problem from Bing significantly larger than typically found in the literature, on publication data, and on a series of synthetic problems. Our results quantify significant improvements due to exploiting multiple sources, which are made possible by global one-to-one constraints linking otherwise independent matching sub-problems. We also discover that our algorithms are complementary: one being much more robust under noise, and the other being simple to implement and very fast to run.

机译：本文探讨了针对多部分图的最大权重图匹配问题的组合优化，该问题是在集成多个数据源时出现的。在最常见的两源情况下，通常希望最终匹配是一对一的。数据库和统计记录链接社区通过对相似性分数进行加权的二部图匹配来实现此目的。这种匹配在直观上很吸引人：它们利用了许多现实世界实体商店的自然全局属性（几乎已被重复数据删除），并且已知可以大大提高准确性和召回率。不幸的是，与二分情况不同，已知多分图上的精确最大权重匹配是NP难的。我们的双重算法贡献近似于多部分最大权重匹配：我们的第一个算法借用了贝叶斯概率推断所通用的优化技术；我们的第二个是贪婪近似算法。除了对后者的理论保证外，我们还对来自Bing的现实世界中的实体解决问题进行了比较，该问题明显大于文献中通常所发现的，关于出版物数据以及一系列综合问题的比较。我们的结果量化了由于开发多个源而产生的重大改进，这是通过全局一对一约束将原本独立的匹配子问题链接起来而实现的。我们还发现我们的算法是互补的：一种算法在噪声下更为健壮，另一种则易于实现且运行速度非常快。

著录项

来源
《Knowledge and Data Engineering, IEEE Transactions on》 |2015年第10期|2784-2796|共13页
作者
Zhang Duo; Rubinstein Benjamin I. P.; Gemmell Jim;
展开▼
作者单位

Twitter, 1355 Market Street, Suite 900, San Francisco, CA, USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Data integration; message-passing algorithms; weighted graph matching;

机译：数据集成;消息传递算法;加权图匹配;

相似文献

外文文献
中文文献
专利

1. Efficient algorithms for fast integration on large data sets from multiple sources [J] . Tian Mi, Sanguthevar Rajasekaran, Robert Aseltine BMC Medical Informatics and Decision Making . 2012,第1期

机译：高源对大数据集的快速集成算法
2. Hierarchical travel demand estimation using multiple data sources: A forward and backward propagation algorithmic framework on a layered computational graph [J] . Wu Xin, Guo Jifu, Xian Kai, Transportation research . 2018,第NOVa期

机译：使用多个数据源的分层旅行需求估计：分层计算图上的前向和后向传播算法框架
3. A Comparison of Stochastic Data-Integration Algorithms for the Joint History Matching of Production and Time-Lapse-Seismic Data [J] . L. Jin, F.O. Alpak, P. van den Hoek, SPE Reservoir Evaluation & Engineering . 2012,第4期

机译：生产与时移地震数据联合历史匹配的随机数据整合算法比较
4. Integrating multiple knowledge sources using genetic algorithmapplied to hierarchically structured sensor data [C] . Sawaragi T., Umemura J., Katai O., Telecommunications in Modern Satellite, Cable and Broadcasting Services, 1999 . 1999

机译：使用适用于分层结构的传感器数据的遗传算法集成多个知识源
5. Data integration for biological network databases: MetNetDB labeled graph model and graph matching algorithm. [D] . Li, Jie. 2008

机译：生物网络数据库的数据集成：MetNetDB标记的图模型和图匹配算法。
6. Efficient algorithms for fast integration on large data sets from multiple sources [O] . Tian Mi, Sanguthevar Rajasekaran, Robert Aseltine 2012

机译：高效算法可快速集成来自多个来源的大型数据集
7. Principled Graph Matching Algorithms for Integrating Multiple Data Sources [O] . Zhang, Duo, Rubinstein, Benjamin I. P., Gemmell, Jim 2014

机译：用于集成多个数据的原理图匹配算法来源
8. Integrating Multiple Sources of Knowledge into Designer-Soar, an Automatic Algorithm Designer. [R] . Steier, D., Newell, A. 1988

机译：将多个知识源集成到Designer-soar中，这是一个自动算法设计器。

Principled Graph Matching Algorithms for Integrating Multiple Data Sources

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅