首页> 外文会议>2013 IEEE International Conference on Big Data >CloudRS: An error correction algorithm of high-throughput sequencing data based on scalable framework
【24h】

CloudRS: An error correction algorithm of high-throughput sequencing data based on scalable framework

机译:CloudRS:一种基于可扩展框架的高通量测序数据的纠错算法

获取原文
获取原文并翻译 | 示例

摘要

Next-generation sequencing (NGS) technologies produce huge amounts of data. These sequencing data unavoidably are accompanied by the occurrence of sequencing errors which constitutes one of the major problems of further analyses. Error correction is indeed one of the critical steps to the success of NGS applications such as de novo genome assembly and DNA resequencing as illustrated in literature. However, it requires computing time and memory space heavily. To design an algorithm to improve data quality by efficiently utilizing on-demand computing resources in the cloud is a challenge for biologists and computer scientists. In this study, we present an error-correction algorithm, called the CloudRS algorithm, for correcting errors in NGS data. The CloudRS algorithm aims at emulating the notion of error correction algorithm of ALLPATHS-LG on the Hadoop/ MapReduce framework. It is conservative in correcting sequencing errors to avoid introducing false decisions, e.g., when dealing with reads from repetitive regions. We also illustrate several probabilistic measures we introduce into CloudRS to make the algorithm more efficient without sacrificing its effectiveness. Running time of using up to 80 instances each with 8 computing units shows satisfactory speedup. Experiments of comparing with other error correction programs show that CloudRS algorithm performs lower false positive rate for most evaluation benchmarks and higher sensitivity on genome S. cerevisiae. We demonstrate that CloudRS algorithm provides significant improvements in the quality of the resulting contigs on benchmarks of NGS de novo assembly.
机译:下一代测序(NGS)技术可产生大量数据。这些测序数据不可避免地伴随着测序错误的发生,这构成了进一步分析的主要问题之一。纠错确实是NGS应用成功的关键步骤之一,例如从头开始的基因组组装和DNA重测序,如文献所示。但是,它需要大量的计算时间和内存空间。设计一种通过有效利用云中的按需计算资源来提高数据质量的算法,对生物学家和计算机科学家来说是一个挑战。在这项研究中,我们提出了一种称为CloudRS算法的纠错算法,用于纠正NGS数据中的错误。 CloudRS算法旨在在Hadoop / MapReduce框架上模拟ALLPATHS-LG的纠错算法的概念。为了避免引入错误的决定,例如在处理来自重复区域的读取时,校正序列错误是保守的。我们还说明了几种引入CloudRS的概率测度,以提高算法的效率而又不牺牲其有效性。使用多达8个计算单元的多达80个实例的运行时间显示出令人满意的加速。与其他纠错程序进行比较的实验表明,对于大多数评估基准,CloudRS算法的假阳性率较低,对酿酒酵母的基因组的敏感性更高。我们证明了CloudRS算法在NGS de novo程序集基准测试中所产生的重叠群的质量上有了重大改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号