首页> 外文期刊>Future generation computer systems >Scalable and efficient whole-exome data processing using workflows on the cloud
【24h】

Scalable and efficient whole-exome data processing using workflows on the cloud

机译:使用云上的工作流程可扩展且高效的全外显子数据处理

获取原文
获取原文并翻译 | 示例
       

摘要

Dataflow-style workflows offer a simple, high-level programming model for flexible prototyping of scientific applications as an attractive alternative to low-level scripting. At the same time, workflow management systems (WFMS) may support data parallelism over big datasets by providing scalable, distributed deployment and execution of the workflow over a cloud infrastructure. In theory, the combination of these properties makes workflows a natural choice for implementing Big Data processing pipelines, common for instance in bioinformatics. In practice, however, correct workflow design for parallel Big Data problems can be complex and very time-consuming. In this paper we present our experience in porting a genomics data processing pipeline from an existing scripted implementation deployed on a closed HPC cluster, to a workflow-based design deployed on the Microsoft Azure public cloud. We draw two contrasting and general conclusions from this project. On the positive side, we show that our solution based on the e-Science Central WFMS and deployed in the cloud clearly outperforms the original HPC-based implementation achieving up to 2.3× speed-up. However, in order to deliver such performance we describe the importance of optimising the workflow deployment model to best suit the characteristics of the cloud computing infrastructure. The main reason for the performance gains was the availability of fast, node-local SSD disks delivered by D-series Azure VMs combined with the implicit use of local disk resources by e-Science Central workflow engines. These conclusions suggest that, on parallel Big Data problems, it is important to couple understanding of the cloud computing architecture and its software stack with simplicity of design, and that further efforts in automating parallelisation of complex pipelines are required.
机译:数据流样式的工作流提供了一个简单的高级编程模型,可灵活地用于科学应用的原型制作,是低级脚本的一种有吸引力的替代方案。同时,工作流管理系统(WFMS)可通过在云基础架构上提供可扩展的分布式部署和工作流的执行来支持大数据集上的数据并行性。从理论上讲,这些属性的组合使工作流成为实现大数据处理管道的自然选择,例如在生物信息学中很常见。但是,实际上,针对并行大数据问题的正确工作流程设计可能很复杂且非常耗时。在本文中,我们介绍了将基因组学数据处理管道从部署在封闭的HPC群集上的现有脚本化实施移植到部署在Microsoft Azure公共云上的基于工作流的设计方面的经验。我们从该项目中得出两个相反的总体结论。从积极的方面来看,我们表明我们基于e-Science Central WFMS并部署在云中的解决方案明显优于原始的基于HPC的实施,可实现2.3倍的加速。但是,为了提供这样的性能,我们描述了优化工作流部署模型以最适合云计算基础架构特征的重要性。性能提高的主要原因是D系列Azure VM提供了快速的节点本地SSD磁盘,并结合了e-Science Central工作流引擎对本地磁盘资源的隐式使用。这些结论表明,在并行大数据问题上,重要的是将对云计算体系结构及其软件堆栈的理解与设计的简单性相结合,并且需要进一步努力使复杂管道的并行化自动化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号