A Practical Roadmap for Provenance Capture and Data Analysis in Spark-Based Scientific Workflows

机译：在基于Spark的科学工作流程中进行来源捕获和数据分析的实用路线图

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Whenever high-performance computing applications meet data-intensive scalable systems, an attractive approach is the use of Apache Spark for the management of scientific workflows. Spark provides several advantages such as being widely supported and granting efficient in-memory data management for large-scale applications. However, Spark still lacks support for data tracking and workflow provenance. Additionally, Spark's memory management requires accessing all data movements between the workflow activities. Therefore, the running of legacy programs on Spark is interpreted as a "black-box" activity, which prevents the capture and analysis of implicit data movements. Here, we present SAMbA, an Apache Spark extension for the gathering of prospective and retrospective provenance and domain data within distributed scientific workflows. Our approach relies on enveloping both RDD structure and data contents at runtime so that (i) RDD-enclosure consumed and produced data are captured and registered by SAMbA in a structured way, and (ii) provenance data can be queried during and after the execution of scientific workflows. By following the W3C PROV representation, we model the roles of RDD regarding prospective and retrospective provenance data. Our solution provides mechanisms for the capture and storage of provenance data without jeopardizing Spark's performance. The provenance retrieval capabilities of our proposal are evaluated in a practical case study, in which data analytics are provided by several SAMbA parameterizations.

机译：每当高性能计算应用程序遇到数据密集型可扩展系统时，一种有吸引力的方法就是使用Apache Spark来管理科学工作流。 Spark具有多种优势，例如得到广泛支持以及为大型应用程序提供有效的内存数据管理。但是，Spark仍然缺乏对数据跟踪和工作流出处的支持。此外，Spark的内存管理要求访问工作流活动之间的所有数据移动。因此，在Spark上运行旧版程序将被解释为“黑匣子”活动，这将阻止捕获和分析隐式数据移动。在这里，我们介绍SAMbA，这是Apache Spark扩展，用于在分布式科学工作流中收集预期和追溯的来源和领域数据。我们的方法依赖于在运行时封装RDD结构和数据内容，以便（i）SAMbA以结构化的方式捕获和注册RDD机柜消耗和产生的数据，并且（ii）在执行期间和执行之后可以查询来源数据科学的工作流程。通过遵循W3C PROV表示，我们对RDD在前瞻性和追溯性来源数据方面的作用进行了建模。我们的解决方案提供了用于捕获和存储源数据的机制，而不会损害Spark的性能。我们的提案的来源检索功能在一个实际案例研究中进行了评估，该案例中的数据分析由多个SAMbA参数化提供。

著录项

来源
《Proceedings of 13th Workshop on Workflows in Support of Large-Scale Science》|2018年|31-41|共11页
会议地点 Dalla(US)
作者
Thaylon Guedes; Vítor Silva; Marta Mattoso; Marcos V. N. Bedo; Daniel de Oliveira;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Sparks; Cluster computing; Data models; Runtime; W3C; Unified modeling language; Data analysis;

机译：Sparks;集群计算;数据模型;运行时; W3C;统一建模语言;数据分析;;

相似文献

外文文献
中文文献
专利

1. Capturing and Analyzing Provenance from Spark-based Scientific Workflows with SAMbA-RaP [J] . Thaylon Guedes, Lucas Bertelli Martins, Maria Luiza Furtuozo Falci, Future generation computer systems . 2020,第Nova期

机译：用Samba-RAP捕捉和分析来自火花的科学工作流的出处
2. End-to-end online performance data capture and analysis for scientific workflows [J] . George Papadimitriou, Cong Wang, Karan Vahi, Future generation computer systems . 2021,第Apra期

机译：科学工作流的端到端在线性能数据捕获和分析
3. Data reduction in scientific workflows using provenance monitoring and user steering [J] . Renan Souza, Vitor Silva, Alvaro L.G.A. Coutinho, Future generation computer systems . 2020,第Sepa期

机译：使用源监控和用户转向的科学工作流程的数据减少
4. A Practical Roadmap for Provenance Capture and Data Analysis in Spark-Based Scientific Workflows [C] . Thaylon Guedes, Vítor Silva, Marta Mattoso, Workshop on Workflows in Support of Large-Scale Science . 2018

机译：基于火花的科学工作流的出差捕获和数据分析的实用路线图
5. Querying and managing Semantic Web data and Scientific Workflow Provenance using relational databases [D] . Chebotko, Artem 2008

机译：使用关系数据库查询和管理语义Web数据和科学工作流出处
6. Computational Testing for Automated Preprocessing 2: Practical Demonstration of a System for Scientific Data-Processing Workflow Management for High-Volume EEG [O] . Benjamin U. Cowley, Jussi Korpela 2018

机译：自动化预处理的计算测试2：大批量脑电图科学数据处理工作流管理系统的实际演示
7. Capturing Workflow Event Data for Monitoring, Performance Analysis, and Management of Scientific Workflows [O] . Valerio, Matthew, Sahoo, Satya S., Barga, Roger, 2008

机译：捕获工作流事件数据以进行监视，性能分析和科学工作流管理

A Practical Roadmap for Provenance Capture and Data Analysis in Spark-Based Scientific Workflows

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅