首页> 外文期刊>Bioinformatics >Compressed suffix tree - a basis for genome-scale sequence analysis
【24h】

Compressed suffix tree - a basis for genome-scale sequence analysis

机译:压缩后缀树-基因组规模序列分析的基础

获取原文
获取原文并翻译 | 示例
       

摘要

Suffix tree is one of the most fundamental data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily explained by the fact that while a DNA sequence of length n from alphabet Sigma = {A, C, G, T} can be stored in n log vertical bar Sigma vertical bar = 2n bits, its suffix tree occupies O(n log n) bits. In practice, the size difference easily reaches factor 50. We provide an implementation of the compressed suffix tree very recently proposed by Sadakane (Theory of Computing Systems, in press). The compressed suffix tree occupies space proportional to the text size, i.e. O(n log vertical bar Sigma vertical bar) bits, and supports all typical suffix tree operations with at most log n factor slowdown. Our experiments show that, e. g. on a 10 MB DNA sequence, the compressed suffix tree takes 10% of the space of normal suffix tree. Typical operations are slowed down by factor 60. Availability: The C++ implementation under GNU license is available at http://www.cs.helsinki.fi/group/suds/cst/. An example program implementing a typical pattern discovery task is included. Experimental results in this note correspond to version 0.95. Contact: vmakinen@cs.helsinki.fi.
机译:后缀树是字符串算法和生物序列分析中最基本的数据结构之一。不幸的是,当要实现这些算法并将其应用于实际的基因组序列时,主内存大小通常会成为瓶颈。这很容易解释,因为可以将n个字母Sigma = {A,C,G,T}的长度为n的DNA序列存储在n log竖线Sigma竖线= 2n位中,但其后缀树占用O(n log n)位。实际上,大小差异很容易达到50倍。我们提供了Sadakane(计算系统理论,印刷中)最近提出的压缩后缀树的实现。压缩后缀树占用的空间与文本大小成正比,即O(n log垂直条Sigma垂直条)位,并支持所有典型的后缀树操作,最多可降低log n因子。我们的实验表明, G。在10 MB的DNA序列上,压缩后缀树占用了正常后缀树的10%的空间。典型操作的速度降低了60倍。可用性:可通过http://www.cs.helsinki.fi/group/suds/cst/获得GNU许可下的C ++实现。包括实现典型模式发现任务的示例程序。本说明中的实验结果与0.95版相对应。联系人:vmakinen@cs.helsinki.fi。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号