首页> 外文期刊>Concurrency and computation: practice and experience >A data dependency based strategy for intermediate data storage in scientific cloud workflow systems*
【24h】

A data dependency based strategy for intermediate data storage in scientific cloud workflow systems*

机译:在科学云工作流系统中基于数据依赖的中间数据存储策略*

获取原文
获取原文并翻译 | 示例

摘要

Many scientific workflows are data intensive where large volumes of intermediate data are generated during their execution. Some valuable intermediate data need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science in the cloud has become popular nowadays, more intermediate data can be stored in scientific cloud workflows based on a pay-for-use model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenance in scientific workflows. With the IDG, deleted intermediate data can be regenerated, and as such we develop a novel intermediate data storage strategy that can reduce the cost of scientific cloud workflow systems by automatically storing appropriate intermediate data sets with one cloud service provider. The strategy has significant research merits, i.e. it achieves a cost-effective trade-off of computation cost and storage cost and is not strongly impacted by the forecasting inaccuracy of data sets' usages. Meanwhile, the strategy also takes the users' tolerance of data accessing delay into consideration. We utilize Amazon's cost model and apply the strategy to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that our strategy can reduce the overall cost of scientific cloud workflow execution significantly.
机译:许多科学工作流程都是数据密集型的,其中在执行过程中会生成大量中间数据。需要存储一些有价值的中间数据以进行共享或重用。传统上,它们是根据手动确定的系统存储容量有选择地存储的。如今,随着在云中进行科学运算变得越来越流行,可以基于按使用付费模型将更多中间数据存储在科学云工作流程中。在本文中,我们从科学工作流中的数据来源构建了一个中间数据依赖图(IDG)。使用IDG,可以重新生成已删除的中间数据,因此,我们开发了一种新颖的中间数据存储策略,该策略可以通过与一个云服务提供商自动存储适当的中间数据集来降低科学云工作流系统的成本。该策略具有重要的研究优点,即,它实现了计算成本和存储成本的经济有效的折衷,并且不受数据集使用情况的预测准确性的强烈影响。同时,该策略还考虑了用户对数据访问延迟的容忍度。我们利用亚马逊的成本模型,并将该策略应用于一般随机以及特定的天体脉冲星搜索科学工作流程进行评估。结果表明,我们的策略可以显着降低科学云工作流执行的总体成本。

著录项

  • 来源
  • 作者单位

    Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn,Melbourne, Vic. 3122, Australia;

    Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn,Melbourne, Vic. 3122, Australia;

    Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn,Melbourne, Vic. 3122, Australia;

    Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn,Melbourne, Vic. 3122, Australia;

    Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn,Melbourne, Vic. 3122, Australia;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    data sets storage; cloud computing; scientific workflow;

    机译:数据集存储;云计算;科学的工作流程;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号