首页> 外文期刊>Journal of Computational Biology >Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences
【24h】

Computation of Rank and Select Functions on Hierarchical Binary String and Its Application to Genome Mapping Problems for Short-Read DNA Sequences

机译:分级二进制字符串上的秩和选择函数的计算及其在短读DNA序列的基因组定位问题中的应用

获取原文
获取原文并翻译 | 示例
           

摘要

We have developed efficient in-practice algorithms for computing rank and select functions on a binary string, based on a novel data structure, a hierarchical binary string with hierarchical accumulatives. It efficiently stores decomposed information on partial summations over various scales of subregions of a given binary string, so that the required space overhead ratio is only about 3.5% irrespective of the string length. Values of rank and select functions are computed hierarchically in (log2n)/8 iterations, where n is the string length. For example, for an unbiased random binary string of 64G bits, each value of these functions can be computed in about a microsecond, on average, on a single 3.0-GHz CPU using 8+ GB of memory. We also present their applications to genome mapping problems for large-scale short-read DNA sequence data, especially produced by ultra-high-throughput new-generation DNA sequencers. The algorithms are applied to the binarization of the Burrows-Wheeler transform of the human genome DNA sequence. For the sake of high-speed performance, we adopted a somewhat stringent mapping condition that allows at most a single-base mismatch (either a substitution, insertion, or deletion of a single base) per query sequence. An experimentally implemented program mapped several thousands of sequences per second on a single 3.0-GHz CPU, several times faster than ELAND, a widely used mapping program with the Illumina-Solexa 1G analyser.
机译:我们已经开发了一种有效的实践算法,用于基于一种新颖的数据结构,具有层次累加的层次二进制字符串来计算二进制字符串上的等级和选择函数。它有效地存储了在给定二进制字符串的各个子区域的各个尺度上的部分求和的分解信息,因此所需的空间开销比率仅为3.5%,而与字符串长度无关。 rank和select函数的值是在(log 2 n)/ 8次迭代中分层计算的,其中n是字符串长度。例如,对于64G位的无偏随机二进制字符串,在使用8 GB以上内存的单个3.0 GHz CPU上,这些函数的每个值平均可在大约微秒内计算出来。我们还介绍了它们在大规模短读DNA序列数据,尤其是超高通量新一代DNA测序仪产生的基因组作图问题中的应用。该算法适用于人类基因组DNA序列的Burrows-Wheeler转换的二值化。为了实现高速性能,我们采用了某种严格的映射条件,每个查询序列最多允许单个碱基不匹配(单个碱基的替换,插入或删除)。通过实验实现的程序在单个3.0 GHz CPU上每秒可以映射数千个序列,比使用Illumina-Solexa 1G分析仪广泛使用的映射程序ELAND快几倍。

著录项

  • 来源
    《Journal of Computational Biology》 |2009年第11期|1601-1613|共13页
  • 作者单位

    Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan.;

    Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.;

    Department of Medical Genome Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.;

    Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号