首页> 外文会议>Asia Pacific Bioinformatics Conference >Improving the sensitivity of long read overlap detection using grouped short k- mer matches
【24h】

Improving the sensitivity of long read overlap detection using grouped short k- mer matches

机译:使用分组的短k-mer比赛提高长读重叠检测的灵敏度

获取原文

摘要

Background: Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as lllumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as lllumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data producedby third-generation sequencing technologies.Results: In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using /c-mer hits for detecting reads' overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage.Conclusions: GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK.
机译:背景:Pacific Biosciences开发的单分子,实时排序(SMRT)产生的读数比二代测序技术,如Lllumina。增加的读取长度使PACBIO测序能够在基因组组件中关闭间隙,揭示结构变化,并表征物种内变化。它还认为,希望在复杂的微生物社区中解读社区结构,因为长期读取有助于偏心组件。使用长读取的基因组组件中的一个关键步骤是快速识别形成重叠的读取。由于PACBIO数据具有更高的测序误差率和比流行的短读取测序技术(例如Lllumina)的较低覆盖率,所以有效地检测真正的重叠需要专门设计的算法。特别地,仍然需要提高检测在两个读取中具有高误差速率的小重叠或重叠的灵敏度。解决此需求将使第三代排序技术产生的Metagenomic数据更好地组装。结果:在这项工作中,我们设计并实现了一个名为Groupk的重叠检测程序,用于基于分组的K-MER命中的第三代排序读取。在使用/ C-MER命中的读取读取的读取'若干现有程序中采用重叠时,我们的方法使用一组短k-mer命令满足统计导出的距离约束,以提高小重叠检测的灵敏度。分组的K-MER命中最初是为同时性搜索而设计的。我们是第一个申请长读重叠检测的申请团体。将管道应用于模拟和真实的第三代排序数据的实验结果表明,Groupk能够更敏感的重叠检测,尤其是对于低序覆盖的数据集。结论:GroupG最好用于检测用于第三代排序数据的小重叠。它为现有的辅助工具提供了一个更敏感和准确的重叠检测。源代码在https://github.com/strideradu/groupk上自由使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号