Low-Complexity Regions (LCRs) of biological sequences are the main source of false positives in similarity searches for such sequence databases. Identifying LCRs in a sequence is a difficult task. Existing tools for identifying LCRs incur large amounts of false positives and false negatives. We consider the problem of finding similar sequences when LCRs are not located precisely. We develop an LCR-based formulation to measure the quality of each letter in a sequence. We show that the quality values can be employed in two fundamental approaches to the sequence search problem to reduce the number of false positives produced by them significantly. The former finds optimal alignments and the latter computes a suboptimal alignment. For the latter one, we also develop a randomized memory-resident hash table that indexes k-grams probabilistically. As a result, memory usage and CPU cost are greatly reduced. We also show that this hash table can be used to reconstruct query sequences with negligible information loss. This eliminates the need to store these sequences. Our experiments on real data show that our quality-based similarity search algorithms reduce the number of false positives drastically. In addition, their running times were better than the existing strategies.
展开▼