This paper compares machine learning techniques and pattern discovery algorithms for the prediction of human single nucleotide polymorphisms (SNPs). We selected six pattern discovery algorithms (YMF, Projection, Weeder, MotifSampler, AlignACE and ANN-Spec) and two machine learning techniques (Random Forests and K-Nearest Neighbours) and applied them to the DNA sequences flanking non-coding SNPs on human chromosome 21. We compared the pattern similarity amongst the methods and validated the predictions using known SNPs on chromosome 22. Parameterization of both machine learning and pattern discovery algorithms was critical to their performance. Memory usage was broadly constant amongst the pattern discovery algorithms, but the CPU running time varied significantly between deterministic and probabilistic pattern discovery methods, i.e., on average, probabilistic methods run19 times slower than deterministic methods. This is the first demonstration of SNP prediction, as well as the first comparison of machine learning and pattern discovery algorithms in SNP prediction studies.
展开▼