首页> 外文期刊>Journal of computational biology: A journal of computational molecular cell biology >A space-efficient construction of the Burrows-Wheeler transform for genomic data
【24h】

A space-efficient construction of the Burrows-Wheeler transform for genomic data

机译:基因组数据的Burrows-Wheeler转换的节省空间的构造

获取原文
获取原文并翻译 | 示例
       

摘要

Algorithms for exact string matching have substantial application in computational biology. Time-efficient data structures which support a variety of exact string matching queries, such as the suffix tree and the suffix array, have been applied to such problems. As sequence databases grow, more space-efficient approaches to exact matching are becoming more important. One such data structure, the compressed suffix array (CSA), based on the Burrows - Wheeler transform, has been shown to require memory which is nearly equal to the memory requirements of the original database, while supporting common sorts of query problems time efficiently. However, building a CSA from a sequence in efficient space and time is challenging. In 2002, the first space-efficient CSA construction algorithm was presented. That implementation used ( 1+ 2 log(2) |Sigma|)( 1+ epsilon) bits per character ( where epsilon is a small fraction). The construction algorithm ran in as much as twice that space, in O(|Sigma| n log(n)) time. We have created an implementation which can also achieve these asymptotic bounds, but for small alphabets, and only uses 1/2 ( 1+|Sigma|)( 1+ epsilon) bits per character, a factor of 2 less space for nucleotide alphabets. We present time and space results for the CSA construction and querying of our implementation on publicly available genome data which demonstrate the practicality of this approach.
机译:精确字符串匹配的算法在计算生物学中具有重要的应用。支持后继树和后缀数组等各种精确的字符串匹配查询的省时数据结构已应用于此类问题。随着序列数据库的增长,更节省空间的精确匹配方法变得越来越重要。一种这样的数据结构,即基于Burrows-Wheeler变换的压缩后缀数组(CSA),显示需要的存储空间几乎等于原始数据库的存储空间,同时又能有效地支持常见的查询问题。但是,从有效的空间和时间序列中构建CSA颇具挑战性。在2002年,提出了第一个节省空间的CSA构造算法。该实现每个字符使用(1+ 2 log(2)| Sigma |)(1+ epsilon)位(其中epsilon是一小部分)。构造算法在O(| Sigma | n log(n))时间内运行的空间是该空间的两倍。我们创建了一个也可以实现这些渐近边界的实现,但是对于小字母,每个字符仅使用1/2(1+ | Sigma |)(1+ epsilon)位,因此核苷酸字母的空间减少了2倍。我们提供了CSA构建的时间和空间结果,并在公开的基因组数据上查询了我们的实现方式,这证明了该方法的实用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号