...
首页> 外文期刊>Journal of Parallel and Distributed Computing >On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems
【24h】

On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems

机译:科学云工作流系统中中间数据集存储的按需最低成本基准测试

获取原文
获取原文并翻译 | 示例
           

摘要

Many scientific workflows are data intensive: large volumes of intermediate datasets are generated during their execution. Some valuable intermediate datasets need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science on clouds has become popular nowadays, more intermediate datasets in scientific cloud workflows can be stored by different storage strategies based on a pay-as-you-go model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenances in scientific workflows. With the IDG, deleted intermediate datasets can be regenerated, and as such we develop a novel algorithm that can find a minimum cost storage strategy for the intermediate datasets in scientific cloud workflow systems. The strategy achieves the best trade-off of computation cost and storage cost by automatically storing the most appropriate intermediate datasets in the cloud storage. This strategy can be utilised on demand as a minimum cost benchmark for all other intermediate dataset storage strategies in the cloud. We utilise Amazon clouds' cost model and apply the algorithm to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that benchmarking effectively demonstrates the cost effectiveness over other representative storage strategies.
机译:许多科学工作流程都是数据密集型的:在执行过程中会生成大量中间数据集。需要存储一些有价值的中间数据集以进行共享或重用。传统上,它们是根据手动确定的系统存储容量有选择地存储的。如今,随着在云上进行科学实验变得越来越流行,可以基于现收现付模型通过不同的存储策略来存储科学云工作流程中的更多中间数据集。在本文中,我们从科学工作流中的数据来源构建了一个中间数据依赖图(IDG)。使用IDG,可以重新生成已删除的中间数据集,因此,我们开发了一种新颖的算法,可以为科学云工作流系统中的中间数据集找到最低成本存储策略。通过自动将最合适的中间数据集存储在云存储中,该策略可以在计算成本和存储成本之间取得最佳平衡。该策略可以按需用作云中所有其他中间数据集存储策略的最低成本基准。我们利用亚马逊云的成本模型,并将该算法应用于一般随机以及特定的天体脉冲星搜索科学工作流程进行评估。结果表明,基准测试有效地证明了其在其他代表性存储策略上的成本效益。

著录项

  • 来源
    《Journal of Parallel and Distributed Computing》 |2011年第2期|p.316-332|共17页
  • 作者单位

    Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne 3122, Victoria, Australia;

    Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne 3122, Victoria, Australia;

    Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne 3122, Victoria, Australia;

    Faculty of Information and Communication Technologies, Swinburne University of Technology, Hawthorn, Melbourne 3122, Victoria, Australia;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    dataset storage; scientific workflow; cloud computing; cost benchmarking;

    机译:数据集存储;科学的工作流程;云计算;成本基准;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号