A good front end for visual feature extraction is an importantelement of audio-visual speech recognition systems. We propose a newvisual feature representation that combines both geometric- andpixel-based features. Using our previously developed contour-basedlip-tracking algorithm, geometric features including the height andwidth of the lips are automatically extracted. Lip boundary trackingallows accurate determination of a region of interest from which weconstruct pixel-based features that are robust to variation in scale andtranslation. Motivated by computational considerations, we selected asubset of the pixels in the center of the inner mouth area that wasfound to capture sufficient details of the appearance of the teeth andtongue for assisting in the discrimination of spoken words. We show theadvantage of the combination of these visual features for visual-onlyand audio-visual speech recognition of isolated digits
展开▼