Predicting protein secondary structure is the process by which, given audsequence of amino acids as input, the secondary structure class of eachudposition in the sequence is predicted. Our approach is built on the extractionudof protein words of a fixed length from protein sequences, followed byudnearest-neighbor classification in order to predict the secondary structureudclass of the middle position in each word. We present a new formulation forudlearning a distance function on protein words based on position-dependentudsubstitution scores on amino acids. These substitution scores are learnedudby solving a large linear programming problem on examples of wordsudwith known secondary structures. We evaluated this approach by using auddatabase of 5519 proteins with a total amino acid length of approximatelyud3000000. Presently, a test scheme using words of length 23 achieved auduniform average over word position of 65.2%. The average accuracy forudalpha-classified words in the test was 63.1%, for beta-classified words wasud56.6%, and for coil classified words was 71.6%.
展开▼