Spark framework for transcriptomic trimming algorithm reduces cost of reading multiple input files

机译：用于转录组修整算法的Spark框架降低了读取多个输入文件的成本

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we investigate the feasibility and performance improvement of adapting a common standalone bioinformatics trimming tool for in-memory processing on a distributed Spark framework. The rapid and continuous rise of genomics technologies and applications demands fast and efficient genomic data processing pipelines. ADAM has emerged as a successful framework for handling large scientific datasets, and efforts are ongoing to expand its functionality in the bioinformatics pipeline. We hypothesize that executing as much of the pipeline as possible within the ADAM framework will improve the pipeline's time and disk requirements. We compare Trimmomatic, one of the most common raw read trimming algorithms, to our own simple Scala trimmer and show that the distributed framework allows our trimmer to suffer less overhead from increasing the number of input files. We conclude that executing Trimmomatic in Spark will improve performance with multiple file inputs. Future work will investigate the performance benefit of passing the distributed dataset directly to ADAM in memory rather than writing out an intermediate file to disk.

机译：在本文中，我们研究了将通用的独立生物信息学修整工具用于分布式Spark框架中的内存处理的可行性和性能改进。基因组学技术和应用的持续快速增长要求快速高效的基因组数据处理管道。 ADAM已成为处理大型科学数据集的成功框架，并且正在努力扩展其在生物信息学渠道中的功能。我们假设在ADAM框架内执行尽可能多的管道会改善管道的时间和磁盘要求。我们将最常见的原始读取修整算法之一Trimmomatic与我们自己的简单Scala修整器进行了比较，结果表明，分布式框架使我们的修整器因增加输入文件的数量而减少了开销。我们得出的结论是，在Spark中执行Trimmomatic将提高多个文件输入的性能。未来的工作将研究将分布式数据集直接传递到内存中的ADAM而不是将中间文件写到磁盘的性能优势。

著录项

来源
《International Conference for Internet Technology and Secured Transactions》|2017年|469-471|共3页
会议地点
作者
Walter Blair; Aspen Olmsted; Paul Anderson;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Bioinformatics; Pipelines; Sparks; Genomics; Tools; Sequential analysis; Big Data;

机译：生物信息学;管道;火花;基因组学;工具;顺序分析;大数据;

相似文献

外文文献
中文文献
专利

1. Traffic grooming algorithms for reducing electronic multiplexing costs in WDM ring networks [J] . Chiu A.L., Modiano E.H. Journal of Lightwave Technology . 2000,第1期

机译：用于减少WDM环网中电子多路复用成本的流量疏导算法
2. Two-stage constellation partition algorithm for reduced-complexity multiple-input multiple-output-maximum-likelihood detection systems [J] . Sulyman A.I., Al-Zahrani Y., Al-Dosari S., Communications, IET . 2012,第18期

机译：降低复杂度的多输入多输出最大似然检测系统的两阶段星座划分算法
3. Trimming Soft-Input Soft-Output Viterbi Algorithms [J] . Qin Huang, Qiang Xiao, Li Quan, IEEE Transactions on Communications . 2016,第7期

机译：修剪软输入软输出维特比算法
4. Spark framework for transcriptomic trimming algorithm reduces cost of reading multiple input files [C] . Walter Blair, Aspen Olmsted, Paul Anderson International Conferece for Internet Technology and Secured Transactions . 2017

机译：转录组修剪算法的Spark框架可降低读取多个输入文件的成本
5. Low-complexity iterative receiver algorithms for multiple-input multiple-output underwater wireless communications [D] . Duan, Weimin. 2016

机译：用于多输入多输出水下无线通信的低复杂性迭代接收算法
6. Usages of Spark Framework with Different Machine Learning Algorithms [O] . Mohamed Ali Mohamed, Ibrahim Mahmoud El-henawy, Ahmad Salah 2021

机译：不同机器学习算法的火花框架的用途
7. Meta-analysis framework for peak calling by combining multiple ChIP-seq algorithms and gene clustering by combining multiple transcriptomic studies [O] . Chen Rui 2015

机译：结合多种ChIP-seq算法进行峰调用的元分析框架和结合多种转录组学研究进行基因聚类
8. Fileread: Software Module for Reading Scenario Model Inputs and Observed Data from Text Files [R] . Doherty, T. J. 2000

机译：Fileread：用于从文本文件中读取场景模型输入和观察到的数据的软件模块

Spark framework for transcriptomic trimming algorithm reduces cost of reading multiple input files

摘要

著录项

相似文献

相关主题

期刊订阅