首页> 外文学位 >Leveraging external user-generated information for large-scale data integration.
【24h】

Leveraging external user-generated information for large-scale data integration.

机译:利用外部用户生成的信息进行大规模数据集成。

获取原文
获取原文并翻译 | 示例

摘要

The proliferation of data sources both in the private and public domains (e.g., in enterprise environments and on the World-Wide Web) underscores the need for data integration systems. The purpose of a data integration system is to enable users to access data residing in multiple heterogenous sources through a uniform interface. Manual solutions for building such systems are not a viable option, especially when dealing with large-scale and complex applications.;This dissertation studies the automation of building data integration systems. In particular, it addresses three key challenges that lie at the heart of any such system.;The first challenge relates to the construction of wrappers for the unstructured sources. A source wrapper would ensure that the data in the underlying source is perceived as structured data by the other parts of the system. We particularly focus on sources containing data formatted as lists, and propose a new solution for extracting relational tables from them. The proposed solution is completely unsupervised and domain-independent. It is based on leveraging various sources of information, including a corpus of tens of millions of relational tables published by users on the Web.;The second and third challenges are concerned with establishing semantic mappings across data sources. We first propose a new solution for discovering the correspondences across the elements of two schemas. Then, based on these simple correspondences, we propose another solution to discover more complex declarative mapping rules that can actually be used to transform data and queries across the two schemas. The key underpinning for these two solutions is that, unlike previous approaches, they both exploit the usage information extracted from database query logs. This work is the first to introduce the usage-based approach for establishing mappings across data sources.;To evaluate our approaches, we conducted experiments using realistic data sets, such as real web lists for the wrapper construction work; and schemas and query logs from the retail and life sciences domains for the work on semantic mappings. The experimental results have verified the effectiveness and applicability of our proposed approaches.
机译:私有域和公共域中(例如,在企业环境中和在万维网上)数据源的激增强调了对数据集成系统的需求。数据集成系统的目的是使用户能够通过统一接口访问驻留在多个异构源中的数据。建立这样的系统的手动解决方案不是一个可行的选择,尤其是在处理大规模和复杂的应用程序时。;本论文研究了建立数据集成系统的自动化。尤其是,它解决了任何此类系统核心的三个关键挑战。第一个挑战涉及为非结构化源构建包装器。源包装器将确保基础源中的数据被系统的其他部分视为结构化数据。我们特别关注包含格式化为列表的数据的源,并提出一种从中提取关系表的新解决方案。所提出的解决方案是完全不受监督且与域无关的。它基于利用各种信息源的信息,包括用户在Web上发布的数千万个关系表的语料库。第二个和第三个挑战涉及跨数据源建立语义映射。我们首先提出一种新的解决方案,用于发现两个模式的元素之间的对应关系。然后,基于这些简单的对应关系,我们提出了另一种解决方案,以发现更复杂的声明性映射规则,这些规则实际上可用于在两种模式之间转换数据和查询。这两种解决方案的关键基础是,与以前的方法不同,它们都利用从数据库查询日志中提取的使用信息。这项工作是第一个引入基于用法的方法来建立跨数据源的映射。为了评估我们的方法,我们使用了真实的数据集进行了实验,例如包装器构造工作的真实Web列表;以及零售和生命科学领域的模式和查询日志,以进行语义映射。实验结果证明了我们提出的方法的有效性和适用性。

著录项

  • 作者

    Elmeleegy, Hazem.;

  • 作者单位

    Purdue University.;

  • 授予单位 Purdue University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 154 p.
  • 总页数 154
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号