首页> 外文期刊>International journal of parallel programming >Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive
【24h】

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

机译:将Hadoop与MPI结合以解决数据密集型和计算密集型的元基因组学问题

获取原文
获取原文并翻译 | 示例
           

摘要

Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193 $$imes $$ × speedup for the computing-intensive step and 9.6 $$imes $$ × speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These results suggest integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.
机译:元基因组学是对环境中所有微生物物种同居者的研究,通常会产生大量的序列数据,范围从几GB到几TB。分析宏基因组学数据包括数据密集型步骤和计算密集型步骤,这使整个过程难以扩展。在这里,我们旨在优化一种宏基因组学应用程序,该应用程序根据其起源物种对短枪宏基因组学序列进行划分。我们的解决方案将基于MapReduce的BioPig分析工具套件与MPI结合在一起,以提供针对数据和计算的可扩展性。我们还通过使用简化的数据类型和压缩的k-mer存储对现有的BioPig工具包进行了一些改进。这些优化使计算密集型步骤的速度提高了193 $$×,整个管道的速度提高了9.6 $$×。我们优化的应用程序还能够处理在相同硬件平台上大16倍的数据集。这些结果表明,将诸如Hadoop和MPI之类的异构技术集成在一起,可以非常有效地解决数据密集型和计算密集型的大型基因组学问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号