首页> 外文学位 >All-pair comparison of billion-base genome sequences.
【24h】

All-pair comparison of billion-base genome sequences.

机译:十亿个碱基的基因组序列的全对比较。

获取原文
获取原文并翻译 | 示例

摘要

The LSH-ALL-PAIRS algorithm is an important method for comparison of genomic DNA sequences in order to find conserved genome features (i.e., subsequences of genomic sequence with d bases, called d-mers) across different species (e.g., Escherichia coli and Mycobacterium tuberculosis ). The algorithm is based on a randomized search technique called locality-sensitive hashing (LSH), which is first applied to all the d-mers. After the hashing step, d-mers with the same hash value are grouped together into a class and pair-wise comparison is performed to find similar d-mers up to a certain number of mismatched bases. Instead of performing pair-wise comparison on potentially very large number of the input d-mers, LSH-ALL-PAIRS algorithm divides the d-mers into classes and performs pair-wise comparison on each individual group. However, since the computational complexity for pair-wise comparison within each class is still O(N2), where N is the number of d-mers in each class, the sequential LSH-ALL-PAIRS algorithm cannot process very long genomic sequence with billions of bases. In this thesis, a parallel implementation of the LSH-ALL-PAIRS algorithm is proposed. The PARALLEL-LSH-ALL-PAIRS algorithm is based on the message passing interface (MPI) and runs on high performance servers, computing clusters, and supercomputers. The PARALLEL-LSH-ALL-PAIRS algorithm first uses a pool of processes to evaluate the hashing function on billions of genomic subsequences in parallel. Then, d-mers with the same hash value are grouped together and redistributed among all the processes using MPI communication. Finally, each process performs pair-wise comparisons of the assigned subsequences and outputs groups of similar pairs. Experiments show that the PARALLEL-LSH-ALL-PAIRS algorithm achieves good scalability with an increasing number of cores and increasing sizes of the input data on the RPI's IBM Blue Gene/Q supercomputer.
机译:LSH-ALL-PAIRS算法是比较基因组DNA序列的重要方法,以便找到不同物种(例如大肠杆菌和分枝杆菌)的保守基因组特征(即具有d碱基的d碱基的基因组序列的子序列,称为d-mers)。结核)。该算法基于一种称为局部敏感哈希(LSH)的随机搜索技术,该技术首先应用于所有d-mers。在散列步骤之后,将具有相同散列值的d-mer分组到一个类中,并进行成对比较,以找到直到一定数目的不匹配碱基的相似d-mer。 LSH-ALL-PAIRS算法不是对可能非常大量的输入d-mer执行成对比较,而是将d-mer分为几类,并在每个单独的组上执行成对比较。但是,由于每个类中成对比较的计算复杂度仍为O(N2),其中N是每个类中d-mer的数目,因此顺序LSH-ALL-PAIRS算法无法处理数十亿个非常长的基因组序列基地。本文提出了LSH-ALL-PAIRS算法的并行实现。 PARALLEL-LSH-ALL-PAIRS算法基于消息传递接口(MPI),并且在高性能服务器,计算集群和超级计算机上运行。 PARALLEL-LSH-ALL-PAIRS算法首先使用过程池来并行评估数十亿个基因组子序列的哈希函数。然后,使用MPI通信将具有相同哈希值的d-mer分组在一起,并在所有进程之间重新分配。最后,每个过程对指定的子序列进行成对比较,并输出相似对的组。实验表明,在RPI的IBM Blue Gene / Q超级计算机上,随着内核数量的增加和输入数据大小的增加,PARALLEL-LSH-ALL-PAIRS算法实现了良好的可伸缩性。

著录项

  • 作者

    Li, Haiqiong.;

  • 作者单位

    Rensselaer Polytechnic Institute.;

  • 授予单位 Rensselaer Polytechnic Institute.;
  • 学科 Computer Science.;Biology Bioinformatics.
  • 学位 M.S.
  • 年度 2013
  • 页码 27 p.
  • 总页数 27
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号