Towards optimizing the execution of spark scientific workflows usingmachine learning-based parameter tuning

de Oliveira Douglas; Porto Fabio; Boeres Cristina; de Oliveira Daniel

首页> 外文期刊>Concurrency and computation: practice and experience >Towards optimizing the execution of spark scientific workflows usingmachine learning-based parameter tuning

【24h】

Towards optimizing the execution of spark scientific workflows usingmachine learning-based parameter tuning

机译：在使用基于MACHINE学习的参数调整的情况下优化Spark Scientific工作流程的执行

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute- and data-intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy-to-install framework, it has more than one hundred parameters to be set, besides domain-specific parameters of each workflow. In this way, to execute Spark-based workflows efficiently, the user has to fine-tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial-and-error manner since it is tedious and error-prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain-specific ones related to the workflow performance in the predictive model.

机译：在过去的几年里，Apache Spark已成为行业和学院项目的大数据系统的标准框架。 Spark用于在生物学和天文学等不同领域执行计算和数据密集型工作流程。虽然Spark是一个易于安装的框架，但它除了每个工作流的域特定参数外，它还有多个参数。通过这种方式，要有效地执行基于火花的工作流，用户必须微调多数火花和工作流参数（例如，分区策略，DNA序列的平均大小等）。此配置任务不能以试验和错误方式手动执行，因为它是繁琐的并且容易出错。本文提出了一种侧重于产生可解释的预测机器学习模型（即，决策树）的方法，然后从这些模型中提取可应用于配置工作流程和火花的未来执行参数的有用规则（即，图案）。非向用户用户。在本文中提供的实验中，所提出的参数配置方法导致处理火花工作流程的性能更好。最后，介绍的方法通过识别与预测模型中的工作流性能相关的最相关的域特定于多个域特定的方法来减少要配置的参数的数量。

著录项

来源
《Concurrency and computation: practice and experience》 |2021年第5期|e5972.1-e5972.35|共35页
作者
de Oliveira Douglas; Porto Fabio; Boeres Cristina; de Oliveira Daniel;
展开▼
作者单位

Univ Fed Fluminense Inst Comp Niteroi RJ Brazil|DexlLab Lab Nacl Comp Cient Data Extreme Lab Petropolis RJ Brazil;

DexlLab Lab Nacl Comp Cient Data Extreme Lab Petropolis RJ Brazil;

Univ Fed Fluminense Inst Comp Niteroi RJ Brazil;

Univ Fed Fluminense Inst Comp Niteroi RJ Brazil|Univ Fed Fluminense UFFeSci Virtual Lab eSci Niteroi RJ Brazil;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Apache spark; machine learning; scientific workflows; Spark parameter tuning;

机译：Apache Spark;机器学习;科学工作流程;火花参数调整;

相似文献

外文文献
中文文献
专利

1. Provenance-and machine learning-based recommendation of parameter values in scientific workflows [J] . Daniel Silva Junior, Esther Pacitti, Aline Paes, PeerJ Computer Science . 2021,第a期

机译：Scenific工作流程中参数值的基于出处和机器学习的建议
2. Using imbalance metrics to optimize task clustering in scientific workflow executions [J] . Weiwei Chen, Rafael Ferreira da Silva, Ewa Deelman, Future generation computer systems . 2015,第may期

机译：使用不平衡指标来优化科学工作流执行中的任务聚类
3. Optimizing execution time predictions of scientific workflow applications in the Grid through evolutionary programming [J] . Farrukh Nadeem, Thomas Fahringer Future generation computer systems . 2013,第4期

机译：通过演化编程优化网格中科学工作流应用程序的执行时间预测
4. TARDIS: Optimal Execution of Scientific Workflows in Apache Spark [C] . Daniel Gaspar, Fabio Porto, Reza Akbarinia, International conference on big data analytics and knowledge discovery . 2017

机译：TARDIS：Apache Spark中科学工作流程的最佳执行
5. Efficient Execution of Scientific Workflows on Batch-Scheduled Clusters [D] . Hataishi, Evan. 2020

机译：有效地执行批量预定集群的科学工作流程
6. Provenance-and machine learning-based recommendation of parameter values in scientific workflows [O] . Daniel Silva Junior, Esther Pacitti, Aline Paes, 2021

机译：Scenific工作流程中参数值的基于出处和机器学习的建议
7. TARDIS: Optimal Execution of Scientific Workflows in Apache Spark [O] . Gaspar, Daniel, Porto, Fabio, Akbarinia, Reza, 2017

机译：TARDIS：Apache Spark中科学工作流程的最佳执行
8. ScyFlow: An Environment for the Visual Specification and Execution of Scientific Workflows [R] . McCann, Karen M., Yarrow, Maurice, DeVivo, Adrian, 2004

机译：scyFlow：视觉规范和执行科学工作流程的环境

Towards optimizing the execution of spark scientific workflows usingmachine learning-based parameter tuning

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅