Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Han Lin; Zhichao Su; Xiandong Meng; Xu Jin; Zhong Wang; Wenting Han; Hong An; Mengxian Chi; Zheng Wu

首页> 外文期刊>International journal of parallel programming >Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

【24h】

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

机译：将Hadoop与MPI结合以解决数据密集型和计算密集型的元基因组学问题

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193 $$imes $$ × speedup for the computing-intensive step and 9.6 $$imes $$ × speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These results suggest integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.

机译：元基因组学是对环境中所有微生物物种同居者的研究，通常会产生大量的序列数据，范围从几GB到几TB。分析宏基因组学数据包括数据密集型步骤和计算密集型步骤，这使整个过程难以扩展。在这里，我们旨在优化一种宏基因组学应用程序，该应用程序根据其起源物种对短枪宏基因组学序列进行划分。我们的解决方案将基于MapReduce的BioPig分析工具套件与MPI结合在一起，以提供针对数据和计算的可扩展性。我们还通过使用简化的数据类型和压缩的k-mer存储对现有的BioPig工具包进行了一些改进。这些优化使计算密集型步骤的速度提高了193 $$×，整个管道的速度提高了9.6 $$×。我们优化的应用程序还能够处理在相同硬件平台上大16倍的数据集。这些结果表明，将诸如Hadoop和MPI之类的异构技术集成在一起，可以非常有效地解决数据密集型和计算密集型的大型基因组学问题。

著录项

来源
《International journal of parallel programming》 |2018年第4期|762-775|共14页
作者
Han Lin; Zhichao Su; Xiandong Meng; Xu Jin; Zhong Wang; Wenting Han; Hong An; Mengxian Chi; Zheng Wu;
展开▼
作者单位

University of Science and Technology of China;

University of Science and Technology of China;

DOE Joint Genome Institute and Lawrence Berkeley National Laboratory;

University of Science and Technology of China;

DOE Joint Genome Institute and Lawrence Berkeley National Laboratory;

University of Science and Technology of China;

University of Science and Technology of China;

University of Science and Technology of China;

University of Science and Technology of China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Metagenomics; Hadoop; MPI; Optimization; Pig Latin; BioPig; Big data; Data-intensive; Compute-intensive;

机译：元基因组学;Hadoop;MPI;优化;拉丁猪;BioPig;大数据;数据密集型;计算密集型;

相似文献

外文文献
中文文献
专利

1. Principles for designing data-/compute-intensive distributed applications and middleware systems for heterogeneous environments [J] . Jik-Soo Kim, Henrique Andrade, Alan Sussman Journal of Parallel and Distributed Computing . 2007,第7期

机译：为异构环境设计数据/计算密集型分布式应用程序和中间件系统的原理
2. MESURING THE EFFICENY OF USING HADOOP TO ANALYZE BIG DATA- A CASE STUDY ON TWITTER DATA SET [J] . YOUSEF K. SANJALAWE, MOHAMMED ANBAR Journal of Theoretical and Applied Information Technology . 2017,第12期

机译：测量使用HADOOP进行大数据分析的效率-以推特数据集为例
3. Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf [J] . Jorge L. Reyes-Ortiz, Luca Oneto, Davide Anguita Procedia Computer Science . 2015,第1期

机译：云中的大数据分析：Hadoop上的Spark与Beowulf上的MPI / OpenMP
4. An Implementation of GPU Accelerated MapReduce: Using Hadoop with OpenCL for Data- and Compute-Intensive Jobs [C] . Xin Miao, Li Hao Service Sciences (IJCSS), 2012 International Joint Conference on . 2012

机译：GPU加速MapReduce的实现：将Hadoop与OpenCL一起用于数据和计算密集型作业
5. Extending the functionalities of Cartesian grid solvers: Viscous effects modeling and MPI parallelization. [D] . Marshall, David D. 2003

机译：扩展笛卡尔网格求解器的功能：粘性效果建模和MPI并行化。
6. 2D-STI combined with gated 99Tcm-MIBI MPI for the diagnosis of myocardial ischemia in hypercholesterolemia patients [O] . Yi Song, Rui-Fang Zhang, Yu Liu -1

机译：2D-STI结合门控99Tcm-MIBI MPI诊断高胆固醇血症患者的心肌缺血
7. Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive [O] . Han Lin, Zhichao Su, Xiandong Meng, 2017

机译：将Hadoop与MPI结合起来解决既有数据和计算密集型的偏心神经问题

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅