首页> 外文期刊>Computational Biology and Bioinformatics, IEEE/ACM Transactions on >A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences
【24h】

A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences

机译:一种用于存储和检索多个生物序列的新型高效数据结构

获取原文
获取原文并翻译 | 示例

摘要

Today's genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language-specific implementations. We present a novel, space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support, and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2 + 8cdot 10^{-6} bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements.
机译:当今的基因组分析应用程序需要序列表示法,以允许快速访问其内容,同时还具有足够的存储效率,以方便进行大规模数据的分析。尽管存在各种各样的序列表示形式,但缺乏有效的序列存储的通用实现方式导致了太多的可重用性差或特定于编程语言的实现方式。我们提出了一种新颖的,节省空间的数据结构(GtEncseq),用于存储字母大小可变的多个生物序列,以及可自定义的字符转换,通配符支持以及为不同的通配符分布和序列长度优化的各种内部表示形式。对于人类基因组(3.1千兆字节,包括2.37亿个通配符),我们的表示每个字符仅需要2 + 8cdot 10 ^ {-6}位。我们的便携式软件实现以C语言实现,提供了多种方法,可使用面向对象的界面对字符和子字符串(包括不同的阅读方向)进行随机和顺序访问。此外,它还包括对元数据的访问,例如序列描述或字符分布。该库可扩展以用于各种脚本语言。 GtEncseq比以前的解决方案具有更多的用途,增加了以前不可用的功能。基准表明,它在空间和时间要求方面具有竞争力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号