Motivation: A major issue in computational biology is the reconstruction of pathways from several genomic datasets, such as expression data, protein interaction data and phylogenetic profiles. As a first step toward this goal, it is important to investigate the amount of correlation which exists between these data. Method: We present new methods to measure the correlation between several heterogeneous datasets, and to extract sets of genes which share similarities with respect to multiple biologicalattributes. The originality of our approach is the extension of the concept of correlation for non-vectorial data, which is made possible by the use of generalized kernel canonical correlation analysis (KCCA), and the method we propose to extract groupsof genes responsible for the detected correlations. Moreover, two variants of KCCA are proposed when more than two datasets are available. Result: These methods are successfully tested on their ability to recognize operons in the Escherichia coli genome, from the comparison of three datasets corresponding to functional relationships between genes in metabolic pathways, geometrical relationships along the chromosome, and co-expression relationships as observed by gene expression data.
展开▼