...
首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >A Hybrid MPI-OpenMP Strategy to Speedup the Compression of Big Next-Generation Sequencing Datasets
【24h】

A Hybrid MPI-OpenMP Strategy to Speedup the Compression of Big Next-Generation Sequencing Datasets

机译:混合MPI-OpenMP策略可加快大型下一代测序数据集的压缩

获取原文
获取原文并翻译 | 示例

摘要

DNA sequencing has moved into the realm of Big Data due to the rapid development of high-throughput, low cost Next-Generation Sequencing (NGS) technologies. Sequential data compression solutions that once were sufficient to efficiently store and distribute this information are now falling behind. In this paper we introduce phyNGSC , a hybrid MPI-OpenMP strategy to speedup the compression of big NGS data by combining the features of both distributed and shared memory architectures. Our algorithm balances work-load among processes and threads, alleviates memory latency by exploiting locality, and accelerates I/O by reducing excessive read/write operations and inter-node message exchange. To make the algorithm scalable, we introduce a novel timestamp-based file structure that allows us to write the compressed data in a distributed and non-deterministic fashion while retaining the capability of reconstructing the dataset with its original order. Our experimental results show that phyNGSC achieved compression times for big NGS datasets that were 45 to 98 percent faster than NGS-specific sequential compressors with throughputs of up to 3 GB/s. Our theoretical analysis and experimental results suggest strong scalability with some datasets yielding super-linear speedups and constant efficiency. We were able to compress 1 terabyte of data in under 8 minutes compared to more than 5 hours taken by NGS-specific compression algorithms running sequentially. Compared to other parallel solutions, phyNGSC achieved up to 6x speedups while maintaining a higher compression ratio. The code for this implementation is available at https://github.com/pcdslab/PHYNGSC.
机译:由于高通量,低成本的下一代测序(NGS)技术的迅速发展,DNA测序已进入大数据领域。顺序数据压缩解决方案曾经足以有效地存储和分发此信息,但现在已经落后了。在本文中,我们介绍了phyNGSC,这是一种MPI-OpenMP混合策略,通过结合分布式和共享内存体系结构的功能来加速大NGS数据的压缩。我们的算法平衡了进程和线程之间的工作量,通过利用局部性来减轻内存延迟,并通过减少过多的读/写操作和节点间消息交换来加速I / O。为了使该算法具有可扩展性,我们引入了一种基于时间戳的新颖文件结构,该文件结构使我们能够以分布式和不确定性方式写入压缩数据,同时保留以原始顺序重建数据集的功能。我们的实验结果表明,对于较大的NGS数据集,phyNGSC的压缩时间比NGS特定的顺序压缩器(吞吐量高达3 GB / s)快45%至98%。我们的理论分析和实验结果表明,某些数据集具有强大的可扩展性,可产生超线性加速和恒定效率。我们能够在8分钟内压缩1 TB的数据,而顺序运行NGS特定的压缩算法则需要5个小时以上。与其他并行解决方案相比,phyNGSC可以实现高达6倍的加速比,同时保持更高的压缩比。可在https://github.com/pcdslab/PHYNGSC上获得此实现的代码。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号