Background: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short readsequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of /oners in reads and validating those withfrequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous /oner may be frequently observed if it has few nucleotide differences with valid /cmers with multiple occurrences in the genome. Error detection and correctionwere mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. Results: We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of /oners from their observed frequencies by analyzing the misread relationships among observed /cmers. We also propose a method to estimate the threshold useful for validating /cmers whoseestimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a frameworkto model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available underGNU GPL3 license and Boost Software V1.0 license at "http://aluru-sun.ece.iastate.edu/doku.php? id=redeem". Conclusions: We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detectingand correcting errors for genomes with high repeat content.
展开▼