首页> 外文会议>IEEE International Symposium on Parallel and Distributed Processing with Applications >A Parallel Algorithm for Compression of Big Next-Generation Sequencing Datasets
【24h】

A Parallel Algorithm for Compression of Big Next-Generation Sequencing Datasets

机译:一种压缩大下一代测序数据集的并行算法

获取原文
获取外文期刊封面目录资料

摘要

The amount of big data from high-throughput NextGeneration Sequencing (NGS) techniques represents various challenges such as storage, analysis and transmission of massive datasets. One solution to storage and transmission of data is compression using specialized compression algorithms. The existing specialized algorithms suffer from poor scalability with increasing size of the datasets and best available solutions can take hours to compress gigabytes of data. Compression and decompression using these techniques for peta-scale data sets is prohibitively expensive in terms of time and energy. In this paper we introduce paraDSRC, a parallel implementation of the DNA Sequence Reads Compression (DSRC) application using a message passing model that presents reduction of the compression time complexity by a factor of O(1/p) (where p is the number of processing units). Our experimental results show that paraDSRC achieves compression times that are 43% to 99% faster than DSRC and compression throughputs of up to 8.4GB/s on a moderate size cluster. For many of the datasets used in our experiments super-linear speedups have been registered making the implementation strongly scalable. We also show that paraDSRC is more than 25.6x faster than comparable parallel compression algorithms. The code is available for free-academic use at https://github.com/PCDS/paraDSRC.
机译:来自高通量NextGeneration测序(NGS)技术的大数据的量表示诸如大规模数据集的存储,分析和传输的各种挑战。存储和传输数据的一个解决方案是使用专用压缩算法的压缩。现有的专业算法随着数据集的升高而越来越差的可扩展性,最佳可用解决方案可能需要数小时才能压缩千兆字节的数据。使用这些用于PETA级数据集技术的压缩和解压缩在时间和能量方面非常昂贵。在本文中,我们介绍了PARADSRC,DNA序列的并行实现使用消息传递模型读取压缩(DSRC)应用程序,该模型概述了o(1 / p)的压缩时间复杂度(其中p是数量处理单位)。我们的实验结果表明,PARADSRC达到了比DSRC和压缩吞吐量在适度尺寸的簇上的速度快43%至99%的压缩时间。对于我们的实验中使用的许多数据集,已经注册了超线性加速,使得实现具有强烈可扩展性。我们还表明,ParadSRC比相可的并行压缩算法快25.6倍。该代码可用于HTTPS://github.com/pcds/paradsrc自由学术用途。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号