首页> 外文会议>IEEE International Parallel Distributed Processing Symposium >Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data
【24h】

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

机译:消除下一代测序数据分析中的顺序瓶颈

获取原文

摘要

Throughput from sequencing instruments has been increasing in an unprecedented speed, leading to an explosion of the next-generation sequencing (NGS) data, and challenges in storing, managing, and analyzing these datasets. Parallelism is the key in handling large-scale data, and some progress has been made in parallelizing important steps, like sequence alignment. However, other major steps continue to be sequential, limiting the ability to handle massive datasets. In this paper, we focus on parallelizing algorithms from two areas. The first is efficient data format conversion among a wide variety of sequence data formats, which is important for cross-utilization of different analysis modules. The second is statistical analysis. Our parallelization sequence data format converter allows sequence datasets in BAM/SAM format to be converted into multiple formats, including SAM/BAM, BED, FASTA, FASTQ, BEDGRAPH, JSON, and YAML, using both shared memory and distributed memory parallelism. The converter currently comprises three instances: SAM format converter, BAM format converter and preprocessing-optimized SAM format converter. Additionally, our converter can also support partial format conversion, to perform format conversion only on a specified chromosome region. The statistical analysis module includes parallelized non-local means (NL-means) algorithm and false discovery rate (FDR) computation. Through extensive evaluation, we demonstrate high scalability of our framework.
机译:测序仪器的吞吐量以前所未有的速度增长,导致下一代测序(NGS)数据激增,并且在存储,管理和分析这些数据集方面面临挑战。并行是处理大规模数据的关键,并且在并行化重要步骤(例如序列比对)方面已经取得了一些进展。但是,其他主要步骤仍然是连续的,从而限制了处理海量数据集的能力。在本文中,我们着重于从两个领域并行化算法。首先是在各种序列数据格式之间进行有效的数据格式转换,这对于不同分析模块的交叉利用非常重要。第二是统计分析。我们的并行化序列数据格式转换器允许使用共享内存和分布式内存并行性将BAM / SAM格式的序列数据集转换为多种格式,包括SAM / BAM,BED,FASTA,FASTQ,BEDGRAPH,JSON和YAML。该转换器当前包括三个实例:SAM格式转换器,BAM格式转换器和经过预处理优化的SAM格式转换器。此外,我们的转换器还可以支持部分格式转换,以仅在指定的染色体区域上执行格式转换。统计分析模块包括并行化的非局部均值(NL-means)算法和错误发现率(FDR)计算。通过广泛的评估,我们证明了我们框架的高度可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号