This paper presents a methodology for automated recognition of isolated words independent of speakers. It utilizes a feature vector consisting of a combination of the first three formant frequencies of the vocal tract and the mean zero crossing rate (ZCR) of the audio signal. Formant frequencies are estimated by simulating the vocal tract by an LPC filter and calculating its resonant frequencies. ZCR is computed by partitioning the audio signal into segments and calculating the number of times the signal crosses the zero amplitude level within each segment. A neural network (multi-layer perceptron) is used as a classifier for identifying the spoken word. The network is trained using a set of specific words uttered by nine speakers (both male and female) and tested for the same words uttered by a different set of speakers. Accuracies indicate that the feature set performs better than contemporary works in extant literature.
展开▼