首页> 外文会议>IEEE EMBS International Conference on Biomedical and Health Informatics >Kafka interfaces for composable streaming genomics pipelines
【24h】

Kafka interfaces for composable streaming genomics pipelines

机译:Kafka接口可组合流式基因组学管道

获取原文

摘要

Modern sequencing machines produce order of a terabyte of data per day, which need subsequently to go through a complex processing pipeline. The conventional workflow begins with a few independent, shared-memory tools, which communicate by means of intermediate files. Given its lack of robustness and scalability, this approach is ill-suited to exploiting the full potential of sequencing in the context of healthcare, where large-scale, population-wide applications are the norm. In this work we propose the adoption of stream computing to simplify the genomic resequencing pipeline, boosting its performance and improving its fault-tolerance. We decompose the first steps of the genomic processing in two distinct and specialized modules (preprocessing and alignment) and we loosely compose them via communication through Kafka streams, in order to allow for easy composability and integration in the already-existing YARN-based pipelines. The proposed solution is then experimentally validated on real data and shown to scale almost linearly.
机译:现代测序仪每天产生的数据量为TB级,随后需要通过复杂的处理管道。传统的工作流程始于一些独立的共享内存工具,这些工具通过中间文件进行通信。由于缺乏健壮性和可扩展性,因此这种方法不适合在医疗保健环境中充分利用测序的全部潜力,在医疗保健中,大规模,全人群的应用是常态。在这项工作中,我们建议采用流计算来简化基因组重测序流程,提高其性能并提高其容错能力。我们将基因组处理的第一步分解为两个截然不同的专用模块(预处理和比对),然后通过Kafka流进行通信来松散地组合它们,以便在现有的基于YARN的管道中轻松组合和集成。所提出的解决方案随后在真实数据上进行了实验验证,并显示出几乎线性的比例。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号