首页> 外文会议>IEEE International Parallel Distributed Processing Symposium >Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

【24h】

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

机译：消除下一代测序数据分析中的顺序瓶颈

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Throughput from sequencing instruments has been increasing in an unprecedented speed, leading to an explosion of the next-generation sequencing (NGS) data, and challenges in storing, managing, and analyzing these datasets. Parallelism is the key in handling large-scale data, and some progress has been made in parallelizing important steps, like sequence alignment. However, other major steps continue to be sequential, limiting the ability to handle massive datasets. In this paper, we focus on parallelizing algorithms from two areas. The first is efficient data format conversion among a wide variety of sequence data formats, which is important for cross-utilization of different analysis modules. The second is statistical analysis. Our parallelization sequence data format converter allows sequence datasets in BAM/SAM format to be converted into multiple formats, including SAM/BAM, BED, FASTA, FASTQ, BEDGRAPH, JSON, and YAML, using both shared memory and distributed memory parallelism. The converter currently comprises three instances: SAM format converter, BAM format converter and preprocessing-optimized SAM format converter. Additionally, our converter can also support partial format conversion, to perform format conversion only on a specified chromosome region. The statistical analysis module includes parallelized non-local means (NL-means) algorithm and false discovery rate (FDR) computation. Through extensive evaluation, we demonstrate high scalability of our framework.

机译：测序仪器的吞吐量以前所未有的速度增长，导致下一代测序（NGS）数据激增，并且在存储，管理和分析这些数据集方面面临挑战。并行是处理大规模数据的关键，并且在并行化重要步骤（例如序列比对）方面已经取得了一些进展。但是，其他主要步骤仍然是连续的，从而限制了处理海量数据集的能力。在本文中，我们着重于从两个领域并行化算法。首先是在各种序列数据格式之间进行有效的数据格式转换，这对于不同分析模块的交叉利用非常重要。第二是统计分析。我们的并行化序列数据格式转换器允许使用共享内存和分布式内存并行性将BAM / SAM格式的序列数据集转换为多种格式，包括SAM / BAM，BED，FASTA，FASTQ，BEDGRAPH，JSON和YAML。该转换器当前包括三个实例：SAM格式转换器，BAM格式转换器和经过预处理优化的SAM格式转换器。此外，我们的转换器还可以支持部分格式转换，以仅在指定的染色体区域上执行格式转换。统计分析模块包括并行化的非局部均值（NL-means）算法和错误发现率（FDR）计算。通过广泛的评估，我们证明了我们框架的高度可扩展性。

著录项

来源
《IEEE International Parallel Distributed Processing Symposium 》|2014年|508-517|共10页
会议地点
作者
Yi Wang; Agrawal G.; Ozer G.; Kun Huang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
data analysis; distributed memory systems; electronic data interchange; shared memory systems; statistical analysis; BAM format converter; BED format; BEDGRAPH format; FASTA format; FASTQ format; FDR computation; JSON format; NGS data analysis; SAM format converter; YAML format; data format conversion; distributed memory parallelism; false discovery rate computation; large-scale data handling; next-generation sequencing data analysis; parallelization sequence data format converter; parallelized NL-means algorithm; parallelized nonlocal means algorithm; parallelizing algorithms; partial format conversion; preprocessing-optimized SAM format converter; sequence alignment; shared memory parallelism; statistical analysis; Algorithm design and analysis; Bioinformatics; Genomics; Histograms; Program processors; Sequential analysis; Statistical analysis; Data Format Conversion; Next-Generation Sequencing; Parallelization; Statistical Analysis;

机译：数据分析;分布式存储系统;电子数据交换;共享存储系统;统计分析; BAM格式转换器; BED格式; BEDGRAPH格式; FASTA格式; FASTQ格式; FDR计算; JSON格式; NGS数据分析; SAM格式转换器; YAML格式数据格式转换分布式内存并行误发现率计算大规模数据处理下一代测序数据分析并行序列数据格式转换器并行NL-means算法并行非局部均值算法并行算法部分格式转换;预处理优化的SAM格式转换器;序列对齐;共享内存并行性;统计分析;算法设计和分析;生物信息学;基因组学;直方图;程序处理器;顺序分析;统计分析;数据格式转换;下一代测序;并行化;统计分析;

相似文献

外文文献
中文文献
专利

1. Removing the bottleneck in whole genome sequencing of Mycobacterium tuberculosis for rapid drug resistance analysis: a call to action [J] . Ruth McNerney, Taane G. Clark, Susana Campino, International journal of infectious diseases : . 2017 ,第Supplementa1期

机译：消除结核分枝杆菌全基因组测序中的瓶颈以进行快速耐药性分析：行动呼吁
2. Nucleotide-Level Variant Analysis of Next-Generation Sequencing Data Using a Cloud-Based Data Analysis Pipeline [J] . G. Asimenos, A. Sundquist Journal of biomolecular techniques :JBT. . 2011 ,第Suppl期

机译：使用基于云的数据分析管道对下一代测序数据进行核苷酸水平的变异分析
3. Time-sequential change in immune-related gene expression after irradiation in glioblastoma: next-generation sequencing analysis [J] . Yi-Jun Kim, Kwangsoo Kim, Soo Yeon Seo, Animal Cells and Systems . 2021 ,第4期

机译：胶质细胞瘤照射后免疫相关基因表达的时间顺序变化：下一代测序分析
4. A sequential Monte Carlo base-calling method for next-generation dna sequencing [C] . Shen Xiaohu, Vikalo Haris Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on . 2011

机译：下一代dna测序的连续蒙特卡洛碱基检出方法
5. Statistical methods for functional metagenomic analysis based on next-generation sequencing data [D] . Pookhao, Naruekamol 2014

机译：基于下一代测序数据的功能性宏基因组学分析的统计方法
6. Nucleotide-Level Variant Analysis of Next-Generation Sequencing Data Using a Cloud-Based Data Analysis Pipeline [O] . G. Asimenos, A. Sundquist, B. Ganter 2011

机译：使用基于云的数据分析管道对下一代测序数据进行核苷酸水平的变异分析
7. Removing the bottleneck in whole genome sequencing of Mycobacterium tuberculosis for rapid drug resistance analysis: a call to action [O] . Ruth McNerney, Taane G. Clark, Susana Campino, 2017

机译：消除结核分枝杆菌全基因组测序中的瓶颈，进行快速耐药性分析：呼吁采取行动

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

摘要

著录项

相似文献

相关主题

期刊订阅