首页> 外文会议>International Conference for Internet Technology and Secured Transactions >Spark framework for transcriptomic trimming algorithm reduces cost of reading multiple input files
【24h】

Spark framework for transcriptomic trimming algorithm reduces cost of reading multiple input files

机译:用于转录组修整算法的Spark框架降低了读取多个输入文件的成本

获取原文

摘要

In this paper, we investigate the feasibility and performance improvement of adapting a common standalone bioinformatics trimming tool for in-memory processing on a distributed Spark framework. The rapid and continuous rise of genomics technologies and applications demands fast and efficient genomic data processing pipelines. ADAM has emerged as a successful framework for handling large scientific datasets, and efforts are ongoing to expand its functionality in the bioinformatics pipeline. We hypothesize that executing as much of the pipeline as possible within the ADAM framework will improve the pipeline's time and disk requirements. We compare Trimmomatic, one of the most common raw read trimming algorithms, to our own simple Scala trimmer and show that the distributed framework allows our trimmer to suffer less overhead from increasing the number of input files. We conclude that executing Trimmomatic in Spark will improve performance with multiple file inputs. Future work will investigate the performance benefit of passing the distributed dataset directly to ADAM in memory rather than writing out an intermediate file to disk.
机译:在本文中,我们研究了将通用的独立生物信息学修整工具用于分布式Spark框架中的内存处理的可行性和性能改进。基因组学技术和应用的持续快速增长要求快速高效的基因组数据处理管道。 ADAM已成为处理大型科学数据集的成功框架,并且正在努力扩展其在生物信息学渠道中的功能。我们假设在ADAM框架内执行尽可能多的管道会改善管道的时间和磁盘要求。我们将最常见的原始读取修整算法之一Trimmomatic与我们自己的简单Scala修整器进行了比较,结果表明,分布式框架使我们的修整器因增加输入文件的数量而减少了开销。我们得出的结论是,在Spark中执行Trimmomatic将提高多个文件输入的性能。未来的工作将研究将分布式数据集直接传递到内存中的ADAM而不是将中间文件写到磁盘的性能优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号