Audio-to-visual synchronization is important for multimedia applications involving talking human, either natural or synthetic. Close correlation exists between the acoustic speech signal and visible lip movement that can be exploited in developing real-time audio-to-visual conversions. In this article, we apply ART2 and a multi-audio-frame technique to derive lip movement sequence from its corresponding audio speech stream. The training process of ART2 is fast and it is capable of learning new things without necessarily forgetting things learned in the past. In the case of multi-user adaptation, we proposed a system which uses one user's ART2 model as the reference model together with audio adapting and visual learning mechanism for new user adaptation. The audio adaptation maps new user's audio features into reference model audio features, and the visual learning makes the reference ART2 model learn the new speech characteristics of the new user. Experimental results had shown that the proposed ART2-based method is both fast and effective for single user and multiuser.
展开▼