首页> 外文会议>Proceedings of 13th Workshop on Workflows in Support of Large-Scale Science >A Practical Roadmap for Provenance Capture and Data Analysis in Spark-Based Scientific Workflows
【24h】

A Practical Roadmap for Provenance Capture and Data Analysis in Spark-Based Scientific Workflows

机译:在基于Spark的科学工作流程中进行来源捕获和数据分析的实用路线图

获取原文
获取原文并翻译 | 示例

摘要

Whenever high-performance computing applications meet data-intensive scalable systems, an attractive approach is the use of Apache Spark for the management of scientific workflows. Spark provides several advantages such as being widely supported and granting efficient in-memory data management for large-scale applications. However, Spark still lacks support for data tracking and workflow provenance. Additionally, Spark's memory management requires accessing all data movements between the workflow activities. Therefore, the running of legacy programs on Spark is interpreted as a "black-box" activity, which prevents the capture and analysis of implicit data movements. Here, we present SAMbA, an Apache Spark extension for the gathering of prospective and retrospective provenance and domain data within distributed scientific workflows. Our approach relies on enveloping both RDD structure and data contents at runtime so that (i) RDD-enclosure consumed and produced data are captured and registered by SAMbA in a structured way, and (ii) provenance data can be queried during and after the execution of scientific workflows. By following the W3C PROV representation, we model the roles of RDD regarding prospective and retrospective provenance data. Our solution provides mechanisms for the capture and storage of provenance data without jeopardizing Spark's performance. The provenance retrieval capabilities of our proposal are evaluated in a practical case study, in which data analytics are provided by several SAMbA parameterizations.
机译:每当高性能计算应用程序遇到数据密集型可扩展系统时,一种有吸引力的方法就是使用Apache Spark来管理科学工作流。 Spark具有多种优势,例如得到广泛支持以及为大型应用程序提供有效的内存数据管理。但是,Spark仍然缺乏对数据跟踪和工作流出处的支持。此外,Spark的内存管理要求访问工作流活动之间的所有数据移动。因此,在Spark上运行旧版程序将被解释为“黑匣子”活动,这将阻止捕获和分析隐式数据移动。在这里,我们介绍SAMbA,这是Apache Spark扩展,用于在分布式科学工作流中收集预期和追溯的来源和领域数据。我们的方法依赖于在运行时封装RDD结构和数据内容,以便(i)SAMbA以结构化的方式捕获和注册RDD机柜消耗和产生的数据,并且(ii)在执行期间和执行之后可以查询来源数据科学的工作流程。通过遵循W3C PROV表示,我们对RDD在前瞻性和追溯性来源数据方面的作用进行了建模。我们的解决方案提供了用于捕获和存储源数据的机制,而不会损害Spark的性能。我们的提案的来源检索功能在一个实际案例研究中进行了评估,该案例中的数据分析由多个SAMbA参数化提供。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号