首页> 外文会议>IEEE International Conference on Big Data >CloudRS: An error correction algorithm of high-throughput sequencing data based on scalable framework
【24h】

CloudRS: An error correction algorithm of high-throughput sequencing data based on scalable framework

机译:CLOUDRS:基于可伸缩框架的高吞吐量排序数据纠错算法

获取原文

摘要

Next-generation sequencing (NGS) technologies produce huge amounts of data. These sequencing data unavoidably are accompanied by the occurrence of sequencing errors which constitutes one of the major problems of further analyses. Error correction is indeed one of the critical steps to the success of NGS applications such as de novo genome assembly and DNA resequencing as illustrated in literature. However, it requires computing time and memory space heavily. To design an algorithm to improve data quality by efficiently utilizing on-demand computing resources in the cloud is a challenge for biologists and computer scientists. In this study, we present an error-correction algorithm, called the CloudRS algorithm, for correcting errors in NGS data. The CloudRS algorithm aims at emulating the notion of error correction algorithm of ALLPATHS-LG on the Hadoop/ MapReduce framework. It is conservative in correcting sequencing errors to avoid introducing false decisions, e.g., when dealing with reads from repetitive regions. We also illustrate several probabilistic measures we introduce into CloudRS to make the algorithm more efficient without sacrificing its effectiveness. Running time of using up to 80 instances each with 8 computing units shows satisfactory speedup. Experiments of comparing with other error correction programs show that CloudRS algorithm performs lower false positive rate for most evaluation benchmarks and higher sensitivity on genome S. cerevisiae. We demonstrate that CloudRS algorithm provides significant improvements in the quality of the resulting contigs on benchmarks of NGS de novo assembly.
机译:下一代测序(NGS)技术产生大量数据。这些测序数据不可避免地伴随着序列误差的发生,其构成了进一步分析的主要问题之一。误差校正确实是NGS应用的成功的关键步骤之一,例如De Novo基因组组装和DNA重新排序,如文献所示。但是,它需要大量计算时间和记忆空间。为了设计一种通过有效利用云中的按需计算资源来改善数据质量是生物学家和计算机科学家的挑战。在这项研究中,我们介绍了一种称为CLOUDR算法的纠错算法,用于校正NGS数据中的错误。 Cloudrs算法旨在模拟Hadoop / MapReduce框架上的AllPaths-LG纠错算法的概念。它是保守的校正测序误差,以避免在处理重复区域的读取时引入错误决定。我们还说明了多种概率措施,我们介绍了CLOUDRS,以使算法更有效而不牺牲其有效性。运行时间使用多达80个实例,每个实例都有8个计算单元显示令人满意的加速。与其他误差校正程序相比的实验表明,CLOUDRS算法对于大多数评估基准和对基因组S.酿酒酵母的敏感性更高的敏感性。我们展示Cloudrs算法在NGS De Novo集装件的基准测试中提供了显着的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号