This paper will summarize our work at Queen's Unvierstiy and ATR Laboratories on cross-modal speech perception and production. Our approach has been to study these two sides of speech together and to use the multi-moal speech production data to parameterize and control audiovisual animation systems. Two approaches to production-based facial animation have been pursued one statistical and the other physical. In both cases, realistic talking head animations are generated from continuous input of production data. The statistical animation method of AV synthesis extends our multi-linear techniques developed for the analysis of orofacial motion and speech acoustics to include the correlation between measured 3D positions on the face and deformation coefficients of the facial surface. In the physical approach, the dynamic form of the animaltion is determined by the biophysical characteristics of the animated object. The physical model consists of multiple structural layers: model skull and jaw surfaces, an orofacial muscle layer, and a three-layer polygon model of the soft tissue. In a series of studies using these animation approaches we have examined the conditions under which speech perception in noise is enhanced by simultaneous visual presentation. Our data shwo a distinction between visual prosody and segmental percetion as well as demonstrating that our animated stimuli produce natural increases in speech intelligibility.
展开▼