All-pair comparison of billion-base genome sequences.

机译：十亿个碱基的基因组序列的全对比较。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The LSH-ALL-PAIRS algorithm is an important method for comparison of genomic DNA sequences in order to find conserved genome features (i.e., subsequences of genomic sequence with d bases, called d-mers) across different species (e.g., Escherichia coli and Mycobacterium tuberculosis ). The algorithm is based on a randomized search technique called locality-sensitive hashing (LSH), which is first applied to all the d-mers. After the hashing step, d-mers with the same hash value are grouped together into a class and pair-wise comparison is performed to find similar d-mers up to a certain number of mismatched bases. Instead of performing pair-wise comparison on potentially very large number of the input d-mers, LSH-ALL-PAIRS algorithm divides the d-mers into classes and performs pair-wise comparison on each individual group. However, since the computational complexity for pair-wise comparison within each class is still O(N2), where N is the number of d-mers in each class, the sequential LSH-ALL-PAIRS algorithm cannot process very long genomic sequence with billions of bases. In this thesis, a parallel implementation of the LSH-ALL-PAIRS algorithm is proposed. The PARALLEL-LSH-ALL-PAIRS algorithm is based on the message passing interface (MPI) and runs on high performance servers, computing clusters, and supercomputers. The PARALLEL-LSH-ALL-PAIRS algorithm first uses a pool of processes to evaluate the hashing function on billions of genomic subsequences in parallel. Then, d-mers with the same hash value are grouped together and redistributed among all the processes using MPI communication. Finally, each process performs pair-wise comparisons of the assigned subsequences and outputs groups of similar pairs. Experiments show that the PARALLEL-LSH-ALL-PAIRS algorithm achieves good scalability with an increasing number of cores and increasing sizes of the input data on the RPI's IBM Blue Gene/Q supercomputer.

机译：LSH-ALL-PAIRS算法是比较基因组DNA序列的重要方法，以便找到不同物种（例如大肠杆菌和分枝杆菌）的保守基因组特征（即具有d碱基的d碱基的基因组序列的子序列，称为d-mers）。结核）。该算法基于一种称为局部敏感哈希（LSH）的随机搜索技术，该技术首先应用于所有d-mers。在散列步骤之后，将具有相同散列值的d-mer分组到一个类中，并进行成对比较，以找到直到一定数目的不匹配碱基的相似d-mer。 LSH-ALL-PAIRS算法不是对可能非常大量的输入d-mer执行成对比较，而是将d-mer分为几类，并在每个单独的组上执行成对比较。但是，由于每个类中成对比较的计算复杂度仍为O（N2），其中N是每个类中d-mer的数目，因此顺序LSH-ALL-PAIRS算法无法处理数十亿个非常长的基因组序列基地。本文提出了LSH-ALL-PAIRS算法的并行实现。 PARALLEL-LSH-ALL-PAIRS算法基于消息传递接口（MPI），并且在高性能服务器，计算集群和超级计算机上运行。 PARALLEL-LSH-ALL-PAIRS算法首先使用过程池来并行评估数十亿个基因组子序列的哈希函数。然后，使用MPI通信将具有相同哈希值的d-mer分组在一起，并在所有进程之间重新分配。最后，每个过程对指定的子序列进行成对比较，并输出相似对的组。实验表明，在RPI的IBM Blue Gene / Q超级计算机上，随着内核数量的增加和输入数据大小的增加，PARALLEL-LSH-ALL-PAIRS算法实现了良好的可伸缩性。

著录项

作者
Li, Haiqiong.;
展开▼
作者单位

Rensselaer Polytechnic Institute.;

展开▼
授予单位 Rensselaer Polytechnic Institute.;
学科 Computer Science.;Biology Bioinformatics.
学位 M.S.
年度 2013
页码 27 p.
总页数 27
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. The 135 kbp mitochondrial genome of Agaricus bisporus is the largest known eukaryotic reservoir of group I introns and plasmid-related sequences. (Special Issue: Agaricus bisporus - beyond the genome.) [J] . Ferandon C., Xu JianPing, Barroso G. Fungal Genetics and Biology . 2013,第Null期

机译：双孢蘑菇的135 kbp线粒体基因组是第一类内含子和质粒相关序列的最大已知真核贮藏库。（特刊：双孢蘑菇-基因组之外。）
2. All-pairs multiple comparisons based on the Cucconi test [J] . Murakami Hidetoshi AStA Advances in statistical analysis . 2016,第3期

机译：基于Cucconi检验的全对多重比较
3. Accelerating the Smith-Waterman algorithm with interpair pruning and band optimization for the all-pairs comparison of base sequences [J] . Daiki Okada, Fumihiko Ino, Kenichi Hagihara BMC Bioinformatics . 2015,第1期

机译：通过对间修剪和带优化来加速Smith-Waterman算法，以便对碱基序列进行全对比较
4. Identification and characterization of accessory genomes in bacterial species based on genome comparison and metagenomic recruitment [C] . Mingjie Wang, Haixu Tang, Yuzhen Ye IEEE International Conference on Bioinformatics and Biomedicine . 2014

机译：基于基因组比较和宏基因组募集，鉴定和鉴定细菌物种中的辅助基因组
5. A computational tool for the prediction of small non-coding RNA in genome sequences. [D] . Yu, Ning. 2009

机译：用于预测基因组序列中小的非编码RNA的计算工具。
6. Accelerating the Smith-Waterman algorithm with interpair pruning and band optimization for the all-pairs comparison of base sequences [O] . Daiki Okada, Fumihiko Ino, Kenichi Hagihara 2015

机译：通过对间修剪和带优化来加速Smith-Waterman算法以对碱基序列进行全对比较
7. Development of a real-time PCR for detection of Staphylococcus pseudintermedius using a novel automated comparison of whole-genome sequences. [O] . Koen M Verstappen, Loes Huijbregts, Mirlin Spaninks, 2017

机译：使用全基因组序列的新型自动比较开发用于检测假葡萄球菌的实时pCR。

All-pair comparison of billion-base genome sequences.

摘要

著录项

相似文献

相关主题

期刊订阅