首页> 外文会议>ACM SIGMOD international conference on management of data >Schema Clustering and Retrieval for Multi-domain Pay-As-You-Go Data Integration Systems
【24h】

Schema Clustering and Retrieval for Multi-domain Pay-As-You-Go Data Integration Systems

机译:模式聚类和检索多域付费和you-go数据集成系统

获取原文

摘要

A data integration system offers a single interface to multiple structured data sources. Many application contexts (e.g., searching structured data on the web) involve the integration of large numbers of structured data sources. At web scale, it is impractical to use manual or semi-automatic data integration methods, so a pay-as-you-go approach is more appropriate. A pay-as-you-go approach entails using a fully automatic approximate data integration technique to provide an initial data integration system (i.e., an initial mediated schema, and initial mappings from source schemas to the mediated schema), and then refining the system as it gets used. Previous research has investigated automatic approximate data integration techniques, but all existing techniques require the schemas being integrated to belong to the same conceptual domain. At web scale, it is impractical to classify schemas into domains manually or semi-automatically, which limits the applicability of these techniques, In this paper, we present an approach for clustering schemas into domains without any human intervention and based only on the names of attributes in the schemas. Our clustering approach deals with uncertainty in assigning schemas to domains using a probabilistic model. We also propose a query classifier that determines, for a given a keyword query, the most relevant domains to this query. We experimentally demonstrate the effectiveness of our schema clustering and query classification techniques.
机译:数据集成系统为多个结构化数据源提供单个接口。许多应用程序上下文(例如,在Web上搜索结构化数据)涉及大量结构化数据源的集成。在Web Scale下,使用手动或半自动数据集成方法是不切实际的,因此您的付费方法更合适。支付支付支付方法需要使用全自动近似数据集成技术来提供初始数据集成系统(即,初始中介模式以及从源模式到介导模式的初始映射),然后改装系统它被使用。以前的研究已经调查了自动近似数据集成技术,但所有现有技术都需要将模式集成到属于同一概念域。在Web Scale下,将模式对手动或半自动分类为域中是不切实际的,这限制了这些技术的适用性,在本文中,我们介绍了将模式的方法纳入域中而没有任何人为干预,并且仅基于名称模式中的属性。我们的聚类方法处理使用概率模型将模式分配给域的不确定性。我们还提出了一个查询分类器,用于给定关键字查询,对此查询的最相关的域来确定该查询。我们通过实验证明了我们的模式聚类和查询分类技术的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号