首页> 美国卫生研究院文献>Genes >A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
【2h】

A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce

机译:使用Hadoop Map-Reduce的基因组序列中SNP检测的快速可扩展工作流

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well.
机译:下一代测序(NGS)技术产生大量的生物学数据,这带来了诸如高处理时间和大内存需求等各种问题。这项研究的重点是检测基因组序列中的单核苷酸多态性(SNP)。当前,SNP检测算法面临若干问题,例如,计算开销成本,准确性和存储器要求。在这项研究中,我们提出了一种快速且可扩展的工作流,该工作流将Bowtie aligner与基于Hadoop的Heap SNP调用程序集成在一起,以改善基因组序列中SNP的检测。通过从公开的网络门户(例如NCBI和DDBJ DRA)获得的基准数据集验证了建议的工作流程。已经进行了广泛的实验,并在对准阶段与Bowtie和BWA对准器进行了比较,而在SNP调用阶段与GATK,FaSD,SparkGA,Halvade和Heap进行了比较。实验结果分析表明,提出的工作流程优于现有框架,例如GATK,FaSD,与BWA和Bowtie aligners集成的Heap,SparkGA和Halvade。拟议的框架平均提高了22.46%的F得分效率和99.80%的一致精度。而且,相对地0.21%意味着更高的精度。此外,还进行了SNP挖掘以鉴定基因组序列中的特定区域。所有框架都是使用内存管理的默认配置实现的。观察结果表明,所有工作流程都具有大致相同的内存需求。将来,它将以图形方式显示挖掘出的SNP,以实现用户友好的交互,并分析和优化内存需求。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号