首页> 外文期刊>Distributed and Parallel Databases >On-demand big data integration: A hybrid ETL approach for reproducible scientific research
【24h】

On-demand big data integration: A hybrid ETL approach for reproducible scientific research

机译:按需大数据集成:可重现科学研究的混合ETL方法

获取原文
获取原文并翻译 | 示例

摘要

Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager extract, transform, and load (ETL) process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. A lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way, hybrid ETL supports incremental integration and loading of metadata and data from the data sources. We incorporate a human-in-the-loop approach, to enhance the hybrid ETL, with selective data integration driven by the user queries and sharing of integrated data between users. We implement our hybrid ETL approach in a prototype platform, bidos, and evaluate it in the context of data sharing for medical research. bidos outperforms both the eager ETL and lazy ETL approaches, for scientific research data integration and sharing, through its selective loading of data and metadata, while storing the integrated data in a scalable integrated data repository.
机译:科学研究需要访问,分析和共享以Internet规模分布在各种异构数据源中的数据。急切的提取,转换和加载(ETL)过程将构建一个集成的数据存储库作为其第一步,从数据源完整地集成和加载数据。对于需要从非常大且通常为数众多的分布式数据源访问数据的科学研究而言,此过程的自举过程效率不高。懒惰的ETL流程仅加载元数据,但仍然很热心。惰性ETL引导速度更快。但是,由于预先拥有整个数据,因此对急切的ETL的集成数据存储库的查询执行得更快。在本文中,我们提出了一种新颖的ETL方法,用于科学数据集成,它是急切的ETL方法和懒惰的ETL方法的结合,并应用于数据和元数据。这样,混合ETL支持增量集成以及从数据源加载元数据和数据。我们采用了人为循环的方法,通过用户查询和用户之间共享的集成数据驱动的选择性数据集成来增强混合ETL。我们在原型平台bidos中实施混合ETL方法,并在医学研究数据共享的背景下对其进行评估。 bidos在选择性地加载数据和元数据的同时,将集成的数据存储在可伸缩的集成数据存储库中,从而在科学研究数据的集成和共享方面优于急切的ETL和惰性ETL方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号