首页> 外文会议>International conference on very large data bases >Less is More: Selecting Sources Wisely for Integration
【24h】

Less is More: Selecting Sources Wisely for Integration

机译:更少的是:明智地选择来源进行集成

获取原文

摘要

We are often thrilled by the abundance of information surrounding us and wish to integrate data from as many sources as possible. However, understanding, analyzing, and using these data are often hard. Too much data can introduce a huge integration cost, such as expenses for purchasing data and resources for integration and cleaning. Furthermore, including low-quality data can even deteriorate the quality of integration results instead of bringing the desired quality gain. Thus, "the more the better" does not always hold for data integration and often "less is more". In this paper, we study how to select a subset of sources before integration such that we can balance the quality of integrated data and integration cost. Inspired by the Marginalism principle in economic theory, we wish to integrate a new source only if its marginal gain, often a function of improved integration quality, is higher than the marginal cost, associated with data-purchase expense and integration resources. As a first step towards this goal, we focus on data fusion tasks, where the goal is to resolve conflicts from different sources. We propose a randomized solution for selecting sources for fusion and show empirically its effectiveness and scalability on both real-world data and synthetic data.
机译:我们常常受到我们周围的丰富信息,并希望尽可能多地将数据集成。但是,了解,分析和使用这些数据通常很难。太多的数据可以引入巨大的集成成本,例如购买数据和资源进行集成和清洁的费用。此外,包括低质量数据甚至可以恶化积分结果的质量,而不是带来所需的质量增益。因此,“更好”并不总是持有数据集成,通常“更少”。在本文中,我们研究了如何在集成之前选择源的子集,这样我们就可以平衡集成数据和集成成本的质量。受到经济理论的边缘主义原则的启发,我们希望仅在其边际增益,往往具有改善的集成质量的函数,高于边际成本,与数据采购费用和集成资源相关联。作为实现这一目标的第一步,我们专注于数据融合任务,目标是解决来自不同来源的冲突。我们提出了一种随机解决方案,用于选择融合来源,并在实际数据和合成数据上统一地显示其有效性和可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号