A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences

Steinbiss Sascha; Kurtz Stefan

首页> 外文期刊>Computational Biology and Bioinformatics, IEEE/ACM Transactions on >A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences

【24h】

A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences

机译：一种用于存储和检索多个生物序列的新型高效数据结构

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Today's genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language-specific implementations. We present a novel, space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support, and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2 + 8cdot 10^{-6} bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements.

机译：当今的基因组分析应用程序需要序列表示法，以允许快速访问其内容，同时还具有足够的存储效率，以方便进行大规模数据的分析。尽管存在各种各样的序列表示形式，但缺乏有效的序列存储的通用实现方式导致了太多的可重用性差或特定于编程语言的实现方式。我们提出了一种新颖的，节省空间的数据结构（GtEncseq），用于存储字母大小可变的多个生物序列，以及可自定义的字符转换，通配符支持以及为不同的通配符分布和序列长度优化的各种内部表示形式。对于人类基因组（3.1千兆字节，包括2.37亿个通配符），我们的表示每个字符仅需要2 + 8cdot 10 ^ {-6}位。我们的便携式软件实现以C语言实现，提供了多种方法，可使用面向对象的界面对字符和子字符串（包括不同的阅读方向）进行随机和顺序访问。此外，它还包括对元数据的访问，例如序列描述或字符分布。该库可扩展以用于各种脚本语言。 GtEncseq比以前的解决方案具有更多的用途，增加了以前不可用的功能。基准表明，它在空间和时间要求方面具有竞争力。

著录项

来源
《Computational Biology and Bioinformatics, IEEE/ACM Transactions on》 |2012年第2期|p.345-357|共13页
作者
Steinbiss Sascha; Kurtz Stefan;
展开▼
作者单位

University of Hamburg, Hamburg;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Data storage representations; biology and genetics; reusable libraries.; software engineering;

机译：数据存储表示;生物学和遗传学;可重用的库;软件工程;

相似文献

外文文献
中文文献
专利

1. Preserving Privacy Of Encrypted Data Stored In Cloud And Enabling Efficient Retrieval Of Encrypted Data Through Blind Storage [J] . Advances in Natural and Applied Sciences . 2016,第10期

机译：保留存储在云中的加密数据的隐私并通过盲存储实现对加密数据的有效检索
2. Secure and efficient data retrieval over encrypted data using attribute-based encryption in cloud storage [J] . Dongyoung Koo, Junbeom Hur, Hyunsoo Yoon Computers and Electrical Engineering . 2013,第1期

机译：在云存储中使用基于属性的加密对加密数据进行安全高效的数据检索
3. Data Structures for Parsimony Correlation and Biosequence Co-Evolution [J] . ROBERT HOCHBERG, TREENA LARREW MILAM Journal of computational biology: A journal of computational molecular cell biology . 2014,第4期

机译：简约相关性和生物序列共同进化的数据结构
4. Robust and efficient algorithms for storage and retrieval of disk based data structures [C] . Kathiravan Srinivasan, Ravinder Kumar, Sahil Singla Proceedings of the 2017 IEEE International Conference on Applied System Innovation . 2017

机译：用于存储和检索基于磁盘的数据结构的鲁棒高效算法
5. Efficient retrieval and scalable storage of multi-dimensional data. [D] . Ferhatosmanoglu, Hakan. 2001

机译：高效检索和可伸缩存储多维数据。
6. Using structured data entry systems in the electronic medical record to collect clinical data for quality and research: Can we efficiently serve multiple needs for complex patients? [O] . Jason P. Van Batavia, Dana A. Weiss, Christopher J. Long, -1

机译：使用电子病历中的结构化数据输入系统收集质量和研究所需的临床数据：我们能否有效满足复杂患者的多种需求？
7. A hybrid spatio-temporal data model and structure (HST-DMS) for efficient storage and retrieval of land use information [O] . Raja Sengupta, Chen Yan 2004

机译：混合时空数据模型和结构（HsT-Dms），用于有效存储和检索土地利用信息
8. Formalizing structured file services for the data storage and retrieval subsystem of the data management system for Spacestation Freedom [R] . Jamsek, Damir A. 1993

机译：为spacestation Freedom的数据管理系统的数据存储和检索子系统形式化结构化文件服务

A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅