Machines which automatically recognize patterns from a stream of acoustic events, for example a spoken command, would have great utility in both communications and data processing. This paper reviews two applications of an elementary recognizer to the problem of actuating certain logical functions, and indicates how more ambitious recognizers might be utilized. In this regard, the automatic measurement of a talker''s voice pitch and voicing dynamics appears fundamental to speech analysis, and hence to many recognition schemes. Visual inspection of spectral data taken from different speakers supports this contention. Segmentation of speech into discrete units suitable for recognition, including the possibility of overlapping elements, is discussed. There is reason to expect that such segments will span several elementary speech sounds (phonemes). To illustrate this approach, a set of rules is presented for associating visual spectral displays (sound spectrograms) with the perception evoked by the corresponding utterances. These rules are specifically tailored for a limited vocabulary consisting of ten spoken numbers, and were validated by naive subjects who used them to identify the utterances of 33 people. In a further experiment, spectrograms of the same material from 14 talkers were simplified by reducing them to binary elements. It was found that master patterns for each number, compiled from the ensemble of talkers, could identify the utterances with over 99% success. These results emphasize a “diversity” approach to speech recognition which operates on relations between gross spectral features and does not depend exclusively on any one property.
展开▼