Among the relevant annotations that can be at-tributed to a protein, domains occupy a key position. Protein domains are sequential and structural motifs that are found independently in different proteins and in different combinations. One of the most widely used domain scheme is the Pfam database which is a collection of protein domain and families. Each family in Pfam is represented by a multiple sequence alignment and a Hidden Markov Model (HMM).When analyzing a new protein sequence, each Pfam HMM is used to compute a score measuring the similarity between the sequence and the domain. If the score is above a given threshold provided by Pfam, the presence of the domain can be asserted in the protein. However, when applied to proteins of organisms with high evolutionary distance from classical model organisms, this strategy may miss several domains. We recently proposed a method, the Co-Occurrence Domain Detection approach (CODD), that improves the sensitivity of Pfam domain detection by exploiting the tendency of domains to appear preferentially with a few other favorite domains in a protein. Here, we propose to integrate domain exclusion information to prune false positive domains that are in conflict with other domains of the protein. Applied to P. falciparum and L. major proteins, we show that this strategy allows to substantially reduce the proportion of false positives among the new domains predicted by CODD, while preserving as much as possible the sensitivity of the approach.
展开▼