生物序列的k-mer频次统计是生物信息处理中一个非常基础且重要的问题.本文针对多序列在对齐模式下,不同偏移处一段长度范围内的k-mer频次统计问题进行了研究.提出了一种逆向遍历k-mer计数算法BTKC.该算法能够充分利用长度的k-mer统计信息,快速得到长度的k-mer统计信息,从而避免了统计任意长度的k-mer频次信息时都需要对所有序列进行遍历.算法的时间复杂度分析及实验结果表明,相比于传统的前向遍历 FTKC算法, BTKC算法性能提升非常明显,且其时间复杂度与k-mer长度的变化范围无关,非常适合于在k-mer长度变化范围较大的情况下使用.%K-mer counting of biological sequence is a fundamental and very important problem in biological information processing. This paper focuses on counting k-mers at each position of multiple sequences within aligned mode. We present a new backward traverse k-mer counting algorithm called BTKC. BTKC algorithm takes full advantage of the k+1-mer’s statistic information to obtain k-mer’s statistic information quickly. Thus, it’s no need to traverse the whole sequences when counting each single k-mer. Both the algorithm’s time complexity and experiment results show that BTKC gets an obvious improvement compared with forward traverse k-mer counting algorithm FTKC, and its time complexity was found not to be realted with the range of k-mer length.
展开▼