首页> 外文期刊>Complex Systems Informatics and Modeling Quarterly >Metadata Extraction and Management in Data Lakes With GEMMS
【24h】

Metadata Extraction and Management in Data Lakes With GEMMS

机译:使用GEMMS进行数据湖中的元数据提取和管理

获取原文
       

摘要

In addition to volume and velocity, Big data is also characterized by its variety. Variety in structure and semantics requires new integration approaches which can resolve the integration challenges also for large volumes of data. Data lakes should reduce the upfront integration costs and provide a more flexible way for data integration and analysis, as source data is loaded in its original structure to the data lake repository. Some syntactic transformation might be applied to enable access to the data in one common repository; however, a deep semantic integration is done only after the initial loading of the data into the data lake. Thereby, data is easily made available and can be restructured, aggregated, and transformed as required by later applications. Metadata management is a crucial component in a data lake, as the source data needs to be described by metadata to capture its semantics. We developed a Generic and Extensible Metadata Management System for data lakes (called GEMMS) that aims at the automatic extraction of metadata from a wide variety of data sources. Furthermore, the metadata is managed in an extensible metamodel that distinguishes structural and semantical metadata. The use case applied for evaluation is from the life science domain where the data is often stored only in files which hinders data access and efficient querying. The GEMMS framework has been proven to be useful in this domain. Especially, the extensibility and flexibility of the framework are important, as data and metadata structures in scientific experiments cannot be defined a priori.
机译:除了数量和速度之外,大数据还具有其多样性。结构和语义的多样性要求新的集成方法,这些方法也可以解决大量数据的集成难题。数据湖应减少前期集成成本,并为数据集成和分析提供更灵活的方法,因为源数据以其原始结构加载到数据湖存储库中。可以应用某种语法转换来访问一个公共存储库中的数据。但是,只有在将数据初始加载到数据湖之后才进行深度语义集成。因此,可以轻松提供数据,并且可以根据以后的应用程序的要求对其进行重组,聚合和转换。元数据管理是数据湖中的关键组成部分,因为源数据需要由元数据描述才能捕获其语义。我们为数据湖开发了通用且可扩展的元数据管理系统(称为GEMMS),旨在从多种数据源中自动提取元数据。此外,元数据是在区分结构和语义元数据的可扩展元模型中管理的。用于评估的用例来自生命科学领域,在该领域中,数据通常仅存储在妨碍数据访问和有效查询的文件中。 GEMMS框架已被证明在该领域很有用。尤其是,该框架的可扩展性和灵活性很重要,因为无法先验地定义科学实验中的数据和元数据结构。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号