The unprecedented rate at which genomic data is accumulated underscores the need to develop highly efficient analytical capabilities. Traditionally, most of the effort post-sequencing has been focused on the identification and annotation of genes along with their promoters and regulatory elements. However, a major part of the vastness outside the gene-space is still left unexplored because of a lack of appropriate computational tools. Here, we propose a new approach for exploring and describing a genome without biasing the search process towards already known structural entities. Our primary objective is to discover novel conserved patterns that would typically fall off the scope of the current suite of repeat finding tools because of irregularities in their structure. The output is a hierarchy of patterns with arbitrary structural characteristics. A hierarchical representation captures the genomic sequence content at an abstract level and offers novel ways to examine the information contained in them. Our approach is an information theoretic search process which uses pattern matching techniques for processing the sequence data. Preliminary evaluation on the Drosophila genome has resulted in the finding of a number of irregular patterns. Discovering new patterns is an important problem in both whole- and comparative genomic application domains. The proposed approach can provide an information-theoretic framework for conducting pattern and knowledge discovery on genomic data.
展开▼