Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences

Kouichi Kimura; Yutaka Suzuki; Sumio Sugano; Asako Koike

首页> 外文期刊>Journal of Computational Biology >Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences

【24h】

Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences

机译：分级二进制字符串上的秩和选择函数的计算及其在短读DNA序列的基因组定位问题中的应用

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We have developed efficient in-practice algorithms for computing rank and select functions on a binary string, based on a novel data structure, a hierarchical binary string with hierarchical accumulatives. It efficiently stores decomposed information on partial summations over various scales of subregions of a given binary string, so that the required space overhead ratio is only about 3.5% irrespective of the string length. Values of rank and select functions are computed hierarchically in (log₂n)/8 iterations, where n is the string length. For example, for an unbiased random binary string of 64G bits, each value of these functions can be computed in about a microsecond, on average, on a single 3.0-GHz CPU using 8+ GB of memory. We also present their applications to genome mapping problems for large-scale short-read DNA sequence data, especially produced by ultra-high-throughput new-generation DNA sequencers. The algorithms are applied to the binarization of the Burrows-Wheeler transform of the human genome DNA sequence. For the sake of high-speed performance, we adopted a somewhat stringent mapping condition that allows at most a single-base mismatch (either a substitution, insertion, or deletion of a single base) per query sequence. An experimentally implemented program mapped several thousands of sequences per second on a single 3.0-GHz CPU, several times faster than ELAND, a widely used mapping program with the Illumina-Solexa 1G analyser.

机译：我们已经开发了一种有效的实践算法，用于基于一种新颖的数据结构，具有层次累加的层次二进制字符串来计算二进制字符串上的等级和选择函数。它有效地存储了在给定二进制字符串的各个子区域的各个尺度上的部分求和的分解信息，因此所需的空间开销比率仅为3.5％，而与字符串长度无关。 rank和select函数的值是在（log _{2 n）/ 8次迭代中分层计算的，其中n是字符串长度。例如，对于64G位的无偏随机二进制字符串，在使用8 GB以上内存的单个3.0 GHz CPU上，这些函数的每个值平均可在大约微秒内计算出来。我们还介绍了它们在大规模短读DNA序列数据，尤其是超高通量新一代DNA测序仪产生的基因组作图问题中的应用。该算法适用于人类基因组DNA序列的Burrows-Wheeler转换的二值化。为了实现高速性能，我们采用了某种严格的映射条件，每个查询序列最多允许单个碱基不匹配（单个碱基的替换，插入或删除）。通过实验实现的程序在单个3.0 GHz CPU上每秒可以映射数千个序列，比使用Illumina-Solexa 1G分析仪广泛使用的映射程序ELAND快几倍。}

著录项

来源
《Journal of Computational Biology》 |2009年第11期|1601-1613|共13页
作者
Kouichi Kimura; Yutaka Suzuki; Sumio Sugano; Asako Koike;
展开▼
作者单位

Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan.;

Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.;

Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.;

Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences [J] . Kimura K, Suzuki Y, Sugano S, Journal of computational biology: A journal of computational molecular cell biology . 2009,第11期

机译：分级二进制字符串上的秩和选择函数的计算及其在短读DNA序列的基因组定位问题中的应用
2. Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences [J] . Asako Koike, Kouichi Kimura, Sumio Sugano, Journal of computational biology . 2009,第11期

机译：分级二进制字符串上的秩和选择函数的计算及其在短读DNA序列的基因组定位问题中的应用
3. Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment [J] . Shakuntala Baichoo, Christos A. Ouzounis BioSystems . 2017,第期

机译：序列比较，短读组件和基因组对齐的算法的计算复杂性
4. Characterization of One-pass and Pull-length Sequences of Oligo-capping Cdna Clones By Genome Mapping [C] . Tetsuo Nishikawa, kouichi Kimura, tomohiro Yasuda, International Conference on Genome Informatics . 2003

机译：基因组映射对寡核封端cDNA克隆单次和拉伸长度序列的表征
5. Genetic and physical mapping of chromosome 4E of tall wheatgrass, Thinopyrum elongatum, and, Molecular detection of rDNA sequences indicates a novel genome donor in the polyploid genus Thinopyrum. [D] . Arterburn, Matthew Keith. 2006

机译：高小麦草，细长的Thin草（Thinopyrum elongatum）4E染色体的遗传和物理作图以及rDNA序列的分子检测表明，多倍体属Thinopyrum属于新型的基因组供体。
6. Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications [O] . Brian D. Ondov, Anjana Varadarajan, Karla D. Passalacqua, -1

机译：针对功能基因组学应用Applied Biosystems SOLiD序列数据到参考基因组的有效作图
7. The Bioinformatics Bookshelf: Teach Yourself Computational Biology? Bioinformatics: The Machine Learning Approach By Pierre Baldi and Soren Brunak Cambridge, MA: MIT Press (1998). 351 pp. $40.00; Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins Edited by Andreas D. Baxevanis and B. F. Francis Ouellette New York: Wiley-lnterscience (1998). 370 pp. $59.95; Guide to Human Genome Computing, Second Edition Edited by Martin J. Bishop San Diego, CA: Academic Press (1998). 306 pp. $69.95; Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids By Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison Cambridge: Cambridge University Press (1998). 356 pp. $34.95; Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology By Dan Gusfield Cambridge: Cambridge University Press (1997). 534 pp. $59.95; Introduction to Computational Molecular Biology By Joao Setubal and Joao Meidanis Boston: PWS Publishing (1997). 296 pp. $61.95 [O] . Pickeral Oxana K, Boguski Mark S 1999

机译：生物信息学书架：自学计算生物学吗？生物信息学：机器学习方法，作者：Pierre Baldi和Soren Brunak剑桥，麻省：麻省理工学院出版社（1998）。 351页，$ 40.00；生物信息学：由Andreas D. Baxevanis和B. F. Francis Ouellette编辑的基因和蛋白质分析实用指南纽约：Wiley-Interscience（1998）。 370页，$ 59.95；《人类基因组计算指南》，第二版，由马丁·J·毕晓普（Martin J. Bishop）编辑，加利福尼亚州圣地亚哥：学术出版社（1998）。 306页，$ 69.95；生物序列分析：蛋白质和核酸的概率模型Richard Durbin，Sean Eddy，Anders Krogh和Graeme Mitchison剑桥：剑桥大学出版社（1998年）。 356页，$ 34.95；字符串，树和序列上的算法：计算机科学和计算生物学Dan Danssfield剑桥：剑桥大学出版社（1997年）。 534页，$ 59.95； Joao Setubal和Joao Meidanis Boston撰写的《计算分子生物学概论》：PWS出版（1997）。 296羽61.95美元

Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences

摘要

著录项

相似文献

相关主题

期刊订阅