首页> 外文期刊>Informatica: An International Journal of Computing and Informatics >Alignment-free Sequence Searching over Whole Genomes Using 3D Random Plot of Query DNA Sequences
【24h】

Alignment-free Sequence Searching over Whole Genomes Using 3D Random Plot of Query DNA Sequences

机译:使用查询DNA序列的3D随机图对整个基因组进行无比对序列搜索

获取原文
           

摘要

Most genomic data studies are based on sequence comparisons and searches, and comparison models based on alignment algorithms are most commonly used. This method is very accurate, but it is useful when the query is short in kilobytes, because it requires the quadratic time and space complexity, O(n2) where n is the length of target and query sequences. With the development of Next Generation Sequencing techniques, researches on whole genome sequence data of megabyte size are being actively studied, and new comparison and search methods for large-scale sequence data are needed. We propose a new alignment-free sequence comparison and search method to overcome the limitations of the alignment-based model. In this graphical model, the sequence searching problem in DNA strings can be reduced to find some parts of geometric object within a relatively small-scale geometric space. When comparing similarity by modifying sequences of similar length, we can confirm that the comparison model is appropriate by accurately reflecting the degree of similarity. When searching the query sequence comparison model based on 200MB sized whole genome sequence, using the compressed coordinate information, it was able to search the 10MB sequences in 22s, which is a very reduced time compared to alignment. Although it is not possible to find the exact position of the base pair unit as in the alignment result, it is a model that can be used as a preprocessing process to quickly search a whole genome sequence of several hundred megabytes-size. Most genomic data studies are based on sequence comparisons and searches, and comparison models based on alignment algorithms are most commonly used. This method is very accurate, but it is useful when the query is short in kilobytes, because it requires the quadratic time and space complexity, O(n2) where n is the length of target and query sequences. With the development of Next Generation Sequencing techniques, researches on whole genome sequence data of megabyte size are being actively studied, and new comparison and search methods for large-scale sequence data are needed. We propose a new alignment-free sequence comparison and search method to overcome the limitations of the alignment-based model. In this graphical model, the sequence searching problem in DNA strings can be reduced to find some parts of geometric object within a relatively small-scale geometric space. When comparing similarity by modifying sequences of similar length, we can confirm that the comparison model is appropriate by accurately reflecting the degree of similarity. When searching the query sequence comparison model based on 200MB sized whole genome sequence, using the compressed coordinate information, it was able to search the 10MB sequences in 22s, which is a very reduced time compared to alignment. Although it is not possible to find the exact position of the base pair unit as in the alignment result, it is a model that can be used as a preprocessing process to quickly search a whole genome sequence of several hundred megabytes-size.
机译:大多数基因组数据研究都是基于序列比较和搜索,而基于比对算法的比较模型是最常用的。此方法非常准确,但是在查询空间不足千字节时很有用,因为它需要二次时间和空间复杂度O(n2),其中n是目标序列和查询序列的长度。随着下一代测序技术的发展,正在积极研究兆字节大小的全基因组序列数据,并且需要新的用于大规模序列数据的比较和搜索方法。我们提出了一种新的无比对序列比较和搜索方法,以克服基于比对的模型的局限性。在此图形模型中,可以减少DNA字符串中的序列搜索问题,以在相对较小的几何空间内找到几何对象的某些部分。通过修改相似长度的序列比较相似度时,我们可以通过准确反映相似度来确认比较模型是合适的。当基于压缩的坐标信息搜索基于200MB大小的全基因组序列的查询序列比较模型时,它能够在22s内搜索10MB序列,与比对相比,这是非常节省的时间。尽管不可能像比对结果那样找到碱基对单位的确切位置,但是它是可以用作预处理过程以快速搜索数百兆字节大小的整个基因组序列的模型。大多数基因组数据研究都基于序列比较和搜索,而最常用的是基于比对算法的比较模型。此方法非常准确,但是在查询空间不足千字节时很有用,因为它需要二次时间和空间复杂度O(n2),其中n是目标序列和查询序列的长度。随着下一代测序技术的发展,正在积极研究兆字节大小的全基因组序列数据,并且需要新的用于大规模序列数据的比较和搜索方法。我们提出了一种新的无比对序列比较和搜索方法,以克服基于比对的模型的局限性。在此图形模型中,可以减少DNA字符串中的序列搜索问题,以在相对较小的几何空间内找到几何对象的某些部分。通过修改相似长度的序列比较相似度时,我们可以通过准确反映相似度来确认比较模型是合适的。当基于压缩的坐标信息搜索基于200MB大小的全基因组序列的查询序列比较模型时,它能够在22s内搜索10MB序列,与比对相比,这是非常节省的时间。尽管不可能像比对结果那样找到碱基对单位的确切位置,但是它是可以用作预处理过程以快速搜索数百兆字节大小的整个基因组序列的模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号