首页> 外文会议>IEEE International Conference on Bioinformatics and Biomedicine >de novo repeat detection based on the third generation sequencing reads

de novo repeat detection based on the third generation sequencing reads

机译:基于第三代排序读取的De Novo重复检测



Repetitive sequences refer to fragments that appear at more than one location in a genome. Numerous studies have shown that the repetitive sequences in genomes play indispensable roles in the evolution, inheritance, variation, gene expression, transcriptional regulation, chromosome construction, and physiological metabolism of organisms. In many sequence and genome analyses such as read alignment, de novo assembly and genome annotation, repetitive sequences can pose major challenges. Detection and classification of repeats is one of the main steps for genome sequence analysis in bioinformatics. However, most existing de novo detection methods are difficult to achieve satisfactory results for marking repetitive regions in both size and accuracy due to the NGS reads are too short to identify long repeats and the raw SMS long reads are with the high error rates. In this study, we present a new de novo repeat detection method called DLR (Detection of Long Repeats) based on PacBio long reads. DLR first converts all long reads into unique k-mers of a certain length, and screens out the k-mers with the high frequency. Then, these high frequency k-mers are aligned to long reads by using multiple sequence alignment, and the high frequency regions on long reads that are covered by those high frequency k-mers are recorded. Finally, the recorded high frequency regions with inclusion relations are merged and the final repetitive sequences are obtained. The experimental results show that DLR achieves optimal results in terms of effective size and accuracy compared with other existing algorithms.
机译:重复序列是指在基因组中出现在多于一个位置的片段。许多研究表明,基因组中的重复序列在生物体的演化,遗传,变异,基因表达,转录调控,染色体构建和生理代谢中起不可或缺的作用。在许多序列和基因组分析中,例如读取对准,de novo组装和基因组注释,重复序列可能会产生重大挑战。重复检测和分类是生物信息学中基因组序列分析的主要步骤之一。然而,大多数现有的DE Novo检测方法难以实现令人满意的结果,以便在尺寸和由于NGS读取的尺​​寸和精度上标记重复区域的令人满意的结果太短而无法识别长重复,并且RAW SMS长读取具有高误差率。在这项研究中,我们介绍了一种新的De Novo重复检测方法,称为DLR(检测长重复的检测),基于PACBIO长读数。 DLR首先将所有长读入一定长度的独特K-MERS,并用高频筛出K-MERS。然后,通过使用多个序列对准,这些高频K-MERS对准到长读取,并且记录由那些高频k-MERS覆盖的长读取的高频区域。最后,合并了具有包含关系的记录的高频区域并获得最终的重复序列。实验结果表明,与其他现有算法相比,DLR在有效尺寸和准确性方面实现了最佳结果。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号