首页> 外文期刊>Concurrency and computation: practice and experience >Towards optimizing the execution of spark scientific workflows usingmachine learning-based parameter tuning
【24h】

Towards optimizing the execution of spark scientific workflows usingmachine learning-based parameter tuning

机译:在使用基于MACHINE学习的参数调整的情况下优化Spark Scientific工作流程的执行

获取原文
获取原文并翻译 | 示例

摘要

In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute compute- and data-intensive workflows in distinct areas like biology and astronomy. Although Spark is an easy-to-install framework, it has more than one hundred parameters to be set, besides domain-specific parameters of each workflow. In this way, to execute Spark-based workflows efficiently, the user has to fine-tune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trial-and-error manner since it is tedious and error-prone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domain-specific ones related to the workflow performance in the predictive model.
机译:在过去的几年里,Apache Spark已成为行业和学院项目的大数据系统的标准框架。 Spark用于在生物学和天文学等不同领域执行计算和数据密集型工作流程。虽然Spark是一个易于安装的框架,但它除了每个工作流的域特定参数外,它还有多个参数。通过这种方式,要有效地执行基于火花的工作流,用户必须微调多数火花和工作流参数(例如,分区策略,DNA序列的平均大小等)。此配置任务不能以试验和错误方式手动执行,因为它是繁琐的并且容易出错。本文提出了一种侧重于产生可解释的预测机器学习模型(即,决策树)的方法,然后从这些模型中提取可应用于配置工作流程和火花的未来执行参数的有用规则(即,图案)。非向用户用户。在本文中提供的实验中,所提出的参数配置方法导致处理火花工作流程的性能更好。最后,介绍的方法通过识别与预测模型中的工作流性能相关的最相关的域特定于多个域特定的方法来减少要配置的参数的数量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号