Computational clusters have long provided a mechanism for the acceleration of high performance computing (HPC) applications. As today's supercomputers approach the petaflop scale, however, they are also exhibiting an increase in heterogeneity. This heterogeneity spans a range of technologies, from multiple operating systems to hardware accelerators and novel architectures. Because of the exceptional acceleration some of these heterogeneous architectures provide, they are being embraced as viable tools for HPC applications, particularly in the area of biological sequence analysis.; In this dissertation we study two of these challenges in detail. We begin with the HMMER sequence analysis suite. It uses a readily parallelizable algorithm based on profile hidden Markov models. However, to date HMMER has seen only limited use in the HPC setting due to its reliance on PVM for parallelization. We develop a more scalable distributed implementation of HMMER, called MPI-HMMER and extend it to include the use of multiple FPGAs for greater acceleration.; The heterogeneous aspect of the acceleration brings to the forefront the second challenge studied in this dissertation: fault-tolerance and checkpointing for HPC systems. To address the challenges of HPC checkpointing, we develop a fault-tolerant MPI based on LAM/MPI with asynchronous replication along with checkpoint migration, eliminating the need for central or network storage and allowing for reconfigurable MPI topologies in the event of node failure. We evaluate centralized storage, SAN-based solutions, and a commercial parallel file system-based solution and show that they are not scalable. As a result, we show that our replication-based checkpointing/migration system is uniquely capable of handling the large amount of data generated by a supercomputing application's checkpoint.; As a first step towards supporting the checkpointing of heterogeneous systems, we then explore the idea of using virtualization for high performance computing. Using OpenVZ, we demonstrate that the checkpointing of virtualized computational clusters is indeed feasible with relatively low overhead. By adapting the idea of checkpoint replication to the virtual environment, we eliminate any need for network storage or centralized servers, and reduce the impact of checkpointing on non-participating cluster nodes and users.
展开▼