首页> 外文会议>IEEE International Conference on Big Data >Uncovering the evolution history of data lakes
【24h】

Uncovering the evolution history of data lakes

机译:发现数据湖的演变历史

获取原文
获取外文期刊封面目录资料

摘要

Data accumulating in data lakes can become inaccessible in the long run when its semantics are not available. The heterogeneity of data formats and the sheer volumes of data collections prohibit cleaning and unifying the data manually. Thus, tools for automated data lake analysis are of great interest. In this paper, we target the particular problem of reconstructing the schema evolution history from data lakes. Knowing how the data is structured, and how this structure has evolved over time, enables programmatic access to the lake. By deriving a sequence of schema versions, rather than a single schema, we take into account structural changes over time. Moreover, we address the challenge of detecting inclusion dependencies. This is a prerequisite for mapping between succeeding schema versions, and in particular, detecting nontrivial changes such as a property having been moved or copied. We evaluate our approach for detecting inclusion dependencies using the MovieLens dataset, as well an adaption of a dataset containing botanical descriptions, to cover specific edge cases.
机译:从长远来看,当数据湖中的数据语义不可用时,它们将变得不可访问。数据格式的异构性和庞大的数据收集量禁止手动清理和统一数据。因此,用于自动数据湖分析的工具引起了极大的兴趣。在本文中,我们针对从数据湖中重建模式演化历史的特定问题。了解数据的结构以及这种结构如何随着时间演变,可以通过编程方式访问湖泊。通过推导一系列模式版本而不是单个模式,我们考虑了一段时间内的结构变化。此外,我们解决了检测包含依赖性的挑战。这是在后续架构版本之间进行映射的先决条件,尤其是检测不重要的更改(例如,已移动或复制属性)的先决条件。我们评估了使用MovieLens数据集以及包含植物学描述的数据集的改编来检测包含依赖性的方法,以涵盖特定的边缘情况。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号