Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

Tanveer Ahmad; Nauman Ahmed; Zaid Al-Ars; H. Peter Hofstee

摘要

Abstract Background Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. Implementation We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. Results Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. Availability The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM .

机译：在测序技术抽象背景无边改进使生产大量高通量和成本效益的下一代测序（NGS）的数据。这个数据需要用于进一步的下游分析被有效地处理。计算系统需要这种大量数据的更靠近所述处理器（低延迟），用于快速和有效的处理。然而，现有的工作流程主要依赖于磁盘存储和访问，来处理这个数据会导致巨大的磁盘I / O开销。此前，由于成本，波动性和DRAM内存等物理限制，它并没有将大量内存的工作数据集是可行的。然而，最近在存储级内存和非易失性存储器技术的发展已经使计算系统放置巨大的数据在内存中直接从内存中，避免磁盘I / O瓶颈进行处理。为了利用这样的存储器系统有效地，适当的格式化的数据放置的好处在存储器和它的高吞吐量接入是通过避免（DE）-serialization和在进程之间复制开销必要的。为此，我们使用了新开发的Apache箭，跨语言的开发框架，高效的内存的大数据分析提供了语言无关的柱状在内存中的数据格式。这使得在不同的编程语言开发的基因组学应用到内存中，而无需访问磁盘存储和避免的（de）-serialization和复制开销通信。实施我们整合的Apache箭头内存基于序列比对/地图（SAM）的格式和它的共享存储在广泛用于基因组学对象储存库的高通量数据处理等BWA-MEM，皮卡德和GATK应用，以允许内存这些应用之间的通信。此外，这还使得我们可以通过共享存储器对象利用表格数据和并行处理能力的高速缓存局部性。结果我们的实施表明，采用高通量数据处理应用导致更好的系统资源利用率在基因组学中，存储器SAM表示，记忆低访问次数由于高缓存局部性开采和平行的可扩展性，由于共享存储器对象。我们实施的重点是GATK最佳实践建议工作流程，对全基因组测序（WGS）和全外显子组测序（WES）数据集生殖分析。我们比较了一些现有的内存数据配售和共享像RAMDISK和Unix管道技术来展示如何柱状在内存中的数据表示优于两者。我们分别达到4.85x和4.76x的WGS和WES的数据，一个加速，在方案调用工作流的总执行时间。类似地，1.45x和1.27倍对这些数据集的加速，分别实现，相比于第二快的工作流程。在一些单独的工具，特别是在整理，去除重复和碱基质量分数校准增速更是看好。供货代码和我们的实验中使用的脚本是在这两个容器和存储库形式可查阅：https://github.com/abs-tudelft/ArrowSAM。

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

摘要

著录项

引文网络

相关主题

期刊订阅