Synthesizing realistic facial animation remains one of the most challenging topics in the graphics community because of the complexity of deformation of a moving face and our inherent sensitivity to the subtleties of human facial motion. The central goal of this dissertation is to attempt data-driven facial animation synthesis that captures the dynamics, naturalness, and personality of facial motion while human subjects are speaking with emotions. The solution is to synthesize realistic 3D talking faces by learning from facial motion capture data. This dissertation addresses three critical parts of realistic talking face synthesis: realistic eye motion synthesis, natural head motion synthesis, and expressive speech animation synthesis.; A texture-synthesis based approach is presented to simultaneously synthesize realistic eye gaze and blink motion, accounting for any possible correlations between the two. The quality of statistical modeling and the introduction of gaze-eyelid coupling are improvement over previous work, and the synthesized eye results are hard to distinguish from actual captured eye motion.; Two different approaches (sample-based and model-based) are presented to synthesize appropriate head motion. Based on the aligned training pairs between audio features and head motion, the sample-based approach uses a K-nearest neighbors-based dynamic programming algorithm to search for the optimal head motion samples given novel speech input. The model-based approach uses the Hidden Markov Models (HMMs) to synthesize natural head motion. HMMs are trained to capture the temporal relation between the acoustic prosodic features and head motion.; This dissertation also presents two different approaches (model-based and sample-based) to generate novel expressive speech animation given new speech input. The model-based synthesis approach accurately learns speech co-articulation models and expression eigen spaces from facial motion data, and then it synthesizes novel expressive speech animations by applying these generative co-articulation models and sampling from the constructed expression eigen spaces. The sample-based synthesis system (eFASE) automatically generates expressive speech animation by concatenating captured facial motion frames while animators establish constraints and goals (novel phoneme-aligned speech input and its emotion modifiers). Users can also edit the processed facial motion database via a novel phoneme-Isomap interface.
展开▼