首页> 外文会议>International Conference on Bioinformatics and Computational Biology >ParSECH: Parallel Sequencing Error Correction with Hadoop for Large-Scale Genome Sequences
【24h】

ParSECH: Parallel Sequencing Error Correction with Hadoop for Large-Scale Genome Sequences

机译:Parsech:用Hadoop进行大规模基因组序列的平行测序纠错

获取原文

摘要

A scalable and accurate error correction tool is essential for all next-generation sequencing (NGS) projects as high-throughput sequencing machines have started producing terabytes of data with significantly higher error-rates compared to conventional Sanger sequencing. In this paper, we develop ParSECH, a scalable and fully distributed error correction software based on k-mer spectrum analysis, without the need of a reference genome. To achieve high scalability over terabytes of data and hundreds of cores, ParSECH utilizes two open-source big data frameworks: Hadoop and Hazelcast. To achieve high accuracy, unlike existing error correction tools that use a single k-mer coverage cutoff to detect errors, ParSECH determines the skewness involved in the k-mer coverage of each individual read, followed by correcting the errors in each read separately for low and high coverage regions of the genome. We demonstrate the scalability of ParSECH by correcting the errors of both simulated and real whole human genome data with coverage ranging from 2x to 40x. ParSECH can correct the largest dataset (452GB human genome), which could not be handled by the existing error correction tools, in about 39 hours. For a small E.coli genome dataset, ParSECH demonstrates 94% accuracy, higher than 90% accuracy of Quake.
机译:可扩展且准确的纠错工具对于所有下一代测序(NGS)项目至关重要,因为与传统的Sanger测序相比,高通量排序机开始产生具有显着更高的误差率的数据的Tberytes。在本文中,我们基于K-MER频谱分析开发了Parsech,可扩展和完全分布式的纠错软件,而不需要参考基因组。为了在数据和数百个核心上实现高度可扩展性,Parsech利用了两个开源大数据框架:Hadoop和Hazelcast。为了实现高精度,与使用单个K-MER覆盖截止的现有纠错工具不同以检测错误,Parsech确定每个单独读取的K-MER覆盖范围中所涉及的偏差,然后纠正每个读取的错误,以便为低电平单独读取和基因组的高覆盖区域。我们通过纠正模拟和真实的全部人类基因组数据的错误,通过从2x到40x的覆盖率校正模拟和真实的全部人类基因组数据的错误来展示PARSECH的可扩展性。 Parsech可以纠正最大的数据集(452GB人类基因组),该数据集无法由现有的纠错工具处理,在大约39小时内。对于小型大肠杆菌基因组数据集,Parsech表明了94%的精度,高于90%的地震精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号