Quality-Based Similarity Search for Biological Sequence Databases

机译：基于质量的相似性搜索生物序列数据库

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Low-Complexity Regions (LCRs) of biological sequences are the main source of false positives in similarity searches for such sequence databases. Identifying LCRs in a sequence is a difficult task. Existing tools for identifying LCRs incur large amounts of false positives and false negatives. We consider the problem of finding similar sequences when LCRs are not located precisely. We develop an LCR-based formulation to measure the quality of each letter in a sequence. We show that the quality values can be employed in two fundamental approaches to the sequence search problem to reduce the number of false positives produced by them significantly. The former finds optimal alignments and the latter computes a suboptimal alignment. For the latter one, we also develop a randomized memory-resident hash table that indexes k-grams probabilistically. As a result, memory usage and CPU cost are greatly reduced. We also show that this hash table can be used to reconstruct query sequences with negligible information loss. This eliminates the need to store these sequences. Our experiments on real data show that our quality-based similarity search algorithms reduce the number of false positives drastically. In addition, their running times were better than the existing strategies.

机译：生物序列的低复杂性区域（LCRS）是对这种序列数据库的相似性搜索中的误报的主要来源。在序列中识别LCR是一项艰巨的任务。用于识别LCR的现有工具会产生大量的误报和错误的否定。我们考虑当LCRS不正确地定位时找到类似序列的问题。我们开发了基于LCR的配方，以测量序列中每个字母的质量。我们表明，质量值可以用两个基本方法采用序列搜索问题，以减少它们产生的误报的数量。前者发现最佳对齐，后者计算了次优对齐。对于后者，我们还开发了一个随机的内存居民哈希表，该表索引概率索引k-grams。结果，大大减少了内存使用和CPU成本。我们还表明，此哈希表可用于重建具有可忽略的信息丢失查询序列。这消除了存储这些序列的需要。我们对实际数据的实验表明，我们的质量基于相似性搜索算法急剧下降减少了误报的数量。此外，他们的运行时间比现有的策略更好。

著录项

来源
《International Conference on Bioinformatics and Computational Biology》|2007年||共7页
会议地点
作者
Xuehui Li; Tamer Kahveci;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-53;
关键词

相似文献

外文文献
中文文献
专利

1. Automated protein sequence database classification.I.Integration of compositional similarity search,local similarity search,and multiple sequence alignment [J] . Jerome Gracy... Bioinformatics . 1998,第2期

机译：自动化蛋白质序列数据库分类.I。组成相似性搜索，局部相似性搜索和多序列比对的整合
2. Similarity-based subsequence search in image sequence databases [J] . Sanghyun Park, Wesley W. Chu International Journal of Image and Graphics . 2003,第1期

机译：图像序列数据库中基于相似度的子序列搜索
3. Efficient processing of similarity search under time warping in sequence databases: an index-based approach [J] . Sang-Wook Kim, Sanghyun Park, Wesley W. Chu Information Systems . 2004,第5期

机译：时间扭曲下序列数据库中相似搜索的有效处理：一种基于索引的方法
4. Quality-Based Similarity Search for Biological Sequence Databases [C] . Xuehui Li, Tamer Kahveci International Conference on Bioinformatics and Computational Biology . 2007

机译：基于质量的相似性搜索生物序列数据库
5. Sequence and structure similarity search in biological and XML databases. [D] . Aghili, S. Alireza. 2005

机译：生物和XML数据库中的序列和结构相似性搜索。
6. Using homology relations within a database markedly boosts protein sequence similarity search [O] . Jing Tong, Ruslan I. Sadreyev, Jimin Pei, 2015

机译：在数据库中使用同源关系可显着提高蛋白质序列相似性搜索
7. Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment [O] . J. Gracy, P. Argos 1998

机译：自动蛋白质序列数据库分类。 I.集成组成相似性搜索，局部相似性搜索和多个序列对齐的集成

Quality-Based Similarity Search for Biological Sequence Databases

摘要

著录项

相似文献

相关主题

期刊订阅