RCSI: Scalable similarity search in thousand(s) of genomes

机译：RCSI：可扩展的相似性在千分之一的基因组中搜索

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Until recently, genomics has concentrated on comparing sequences between species. However, due to the sharply falling cost of sequencing technology, studies of populations of individuals of the same species are now feasible and promise advances in areas such as personalized medicine and treatment of genetic diseases. A core operation in such studies is read mapping, i.e., finding all parts of a set of genomes which are within edit distance k to a given query sequence (k-approximate search). To achieve sufficient speed, current algorithms solve this problem only for one to-be-searched genome and compute only approximate solutions, i.e., they miss some k-approximate occurrences. We present RCSI, Referentially Compressed Search Index, which scales to a thousand genomes and computes the exact answer. It exploits the fact that genomes of different individuals of the same species are highly similar by first compressing the to-be-searched genomes with respect to a reference genome. Given a query, RCSI then searches the reference and all genome-specific individual differences. We propose efficient data structures for representing compressed genomes and present algorithms for scalable compression and similarity search. We evaluate our algorithms on a set of 1092 human genomes, which amount to approx. 3 TB of raw data. RCSI compresses this set by a ratio of 450:1 (26:1 including the search index) and answers similarity queries on a mid-class server in 15 ms on average even for comparably large error thresholds, thereby significantly outperforming other methods. Furthermore, we present a fast and adaptive heuristic for choosing the best reference sequence for referential compression, a problem that was never studied before at this scale.

机译：直到最近，基因组学专注于比较物种之间的序列。然而，由于排序技术的急剧下降，同一物种的个体种群的研究现在是可行的，并且在个性化医学和遗传疾病的治疗等领域的承诺进展。在这些研究中的核心操作是读取映射，即，找到在编辑距离k内的一组基因组的所有部分到给定的查询序列（K近似搜索）。为了实现足够的速度，目前的算法仅解决一个待搜索的基因组并仅计算近似解决方案，即，它们错过了一些k近似出现。我们呈现RCSI，参考压缩搜索索引，该索引缩放到千种族，并计算确切的答案。它利用相同物种的不同个体的基因组通过首先将被搜索的基因组压缩到参考基因组来高度相似。鉴于查询，RCSI然后搜索引用和所有基因组特定的单个差异。我们提出了用于表示压缩基因组的有效数据结构和用于可伸缩压缩和相似性搜索的现有算法。我们在一组1092人类基因组上评估我们的算法，其数量约为约。 3 TB原始数据。 rcsi按比率为450：1（26：1，包括搜索索引）的比率，即使对于相对的误差阈值，平均也将在15 ms中答案相似度查询，从而显着优于其他方法。此外，我们展示了一种快速和自适应的启发式，用于选择参考压缩的最佳参考序列，这是在此规模之前从未研究过的问题。

著录项

来源
《International conference on very large data bases》|2013年||共12页
会议地点
作者
Sebastian Wandelt; Johannes Starlinger; Marc Bux; Ulf Leser;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13;
关键词

相似文献

外文文献
中文文献
专利

1. A Decentralized Method for Scaling Up Genome Similarity Search Services [J] . Zhou B.B., Wang C., Zomaya Albert Y. IEEE Transactions on Parallel and Distributed Systems . 2009,第3期

机译：扩大基因组相似性搜索服务的分散方法
2. Local Similarity Search to Find Gene Indicators in Mitochondrial Genomes [J] . Martin Middendorf, Matthias Bernt, Ruby L. V. Moritz Biology . 2014,第1期

机译：本地相似性搜索以找到线粒体基因组中的基因指标
3. Genome-wide similarity search for transcription factors and their binding sites in a metal-reducing prokaryote Geobacter sulfurreducens [J] . Bin Yana, Derek R. Lovley, Julia Krushkal BioSystems . 2007,第2期

机译：全基因组相似性搜索还原金属的原核生物减少细菌中的转录因子及其结合位点
4. RCSI: Scalable similarity search in thousand(s) of genomes [C] . Sebastian Wandelt, Johannes Starlinger, Marc Bux, International conference on very large data bases . 2013

机译：RCSI：可在数千个基因组中进行可扩展的相似性搜索
5. Learning Effective Binary Representation with Deep Hashing Technique for Large-Scale Multimedia Similarity Search [D] . Wu, Gengshen. 2020

机译：学习具有深度散列技术的有效二进制表示，用于大规模多媒体相似性搜索
6. Large-Scale Comparison of Alternative Similarity Search Strategies with Varying ChemicalInformation Contents [O] . Oliver Laufkötter, Tomoyuki Miyao, Jürgen Bajorath, 2019

机译：各种化学物质的替代相似性搜索策略的大规模比较信息内容
7. Database-integrated genome screening (DIGS): exploring genomes heuristically using sequence similarity search tools and a relational database [O] . Henan Zhu, Tristan Dennis, Joseph Hughes, 2018

机译：数据库 - 集成基因组筛选（DIGS）：使用序列相似性搜索工具和关系数据库启发出来的基因组

RCSI: Scalable similarity search in thousand(s) of genomes

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅