首页> 外文学位 >Fault-tolerant techniques for high performance computing and a bioinformatics application.
【24h】

Fault-tolerant techniques for high performance computing and a bioinformatics application.

机译:高性能计算和生物信息学应用程序的容错技术。

获取原文
获取原文并翻译 | 示例

摘要

Computational clusters have long provided a mechanism for the acceleration of high performance computing (HPC) applications. As today's supercomputers approach the petaflop scale, however, they are also exhibiting an increase in heterogeneity. This heterogeneity spans a range of technologies, from multiple operating systems to hardware accelerators and novel architectures. Because of the exceptional acceleration some of these heterogeneous architectures provide, they are being embraced as viable tools for HPC applications, particularly in the area of biological sequence analysis.; In this dissertation we study two of these challenges in detail. We begin with the HMMER sequence analysis suite. It uses a readily parallelizable algorithm based on profile hidden Markov models. However, to date HMMER has seen only limited use in the HPC setting due to its reliance on PVM for parallelization. We develop a more scalable distributed implementation of HMMER, called MPI-HMMER and extend it to include the use of multiple FPGAs for greater acceleration.; The heterogeneous aspect of the acceleration brings to the forefront the second challenge studied in this dissertation: fault-tolerance and checkpointing for HPC systems. To address the challenges of HPC checkpointing, we develop a fault-tolerant MPI based on LAM/MPI with asynchronous replication along with checkpoint migration, eliminating the need for central or network storage and allowing for reconfigurable MPI topologies in the event of node failure. We evaluate centralized storage, SAN-based solutions, and a commercial parallel file system-based solution and show that they are not scalable. As a result, we show that our replication-based checkpointing/migration system is uniquely capable of handling the large amount of data generated by a supercomputing application's checkpoint.; As a first step towards supporting the checkpointing of heterogeneous systems, we then explore the idea of using virtualization for high performance computing. Using OpenVZ, we demonstrate that the checkpointing of virtualized computational clusters is indeed feasible with relatively low overhead. By adapting the idea of checkpoint replication to the virtual environment, we eliminate any need for network storage or centralized servers, and reduce the impact of checkpointing on non-participating cluster nodes and users.
机译:计算集群长期以来为加速高性能计算(HPC)应用程序提供了一种机制。但是,随着当今的超级计算机接近petaflop规模,它们的异构性也在增加。这种异质性涵盖了多种技术,从多个操作系统到硬件加速器和新颖的体系结构。由于这些异质架构中的某些提供了超凡的加速,它们被视为适用于HPC应用的可行工具,特别是在生物序列分析领域。在本文中,我们详细研究了其中两个挑战。我们从HMMER序列分析套件开始。它使用基于轮廓隐式马尔可夫模型的易于并行化的算法。但是,到目前为止,由于HMMER依赖于PVM进行并行化,因此在HPC设置中仅使用有限。我们开发了一种更具扩展性的HMMER分布式实现,称为MPI-HMMER,并将其扩展为包括使用多个FPGA来实现更大的加速。加速度的异质性将本文研究的第二个挑战带到了最前沿:HPC系统的容错和检查点。为了解决HPC检查点的挑战,我们开发了基于LAM / MPI的容错MPI,具有异步复制和检查点迁移功能,消除了对中央或网络存储的需求,并允许在发生节点故障时重新配置MPI拓扑。我们评估了集中存储,基于SAN的解决方案和基于商业并行文件系统的解决方案,并证明它们不可扩展。结果,我们证明了基于复制的检查点/迁移系统具有独特的能力,能够处理由超级计算应用程序的检查点生成的大量数据。作为支持异构系统检查点的第一步,我们然后探讨了将虚拟化用于高性能计算的想法。使用OpenVZ,我们证明了虚拟化计算集群的检查点确实是可行的,而且开销相对较低。通过使检查点复制的思想适应虚拟环境,我们消除了对网络存储或集中式服务器的任何需求,并减少了检查点对非参与群集节点和用户的影响。

著录项

  • 作者

    Walters, John Paul N.;

  • 作者单位

    Wayne State University.$bComputer Science.;

  • 授予单位 Wayne State University.$bComputer Science.;
  • 学科 Biology Bioinformatics.; Computer Science.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 174 p.
  • 总页数 174
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号