This thesis represents Bayesian joint audio-visual tracking for the 3D locations of multiple people and a current speaker(s) in a real conference environment. To achieve this objective, it focuses on several different research interests, such as acoustic-feature detection, visual-feature detection, a tracking framework, data association, and sensor fusion. As acoustic-feature detection, time-delay-of-arrival (TDOA) estimation is used for the detection of multiple acoustic sources. Localization performance using TDOAs is also analyzed according to different configurations of microphones. As visual-feature detection, Viola-Jones face detection is used to initialize the locations of unknown multiple people. Then, motion detection using a corner feature, based on the results from the Viola-Jones face detection, is used to follow these non-rigid frontal faces/face profile/upper bodies in normal tracking mode. Simple point-to-line correspondences between multiple cameras using fundamental matrices are used to determine which features are more robust. As a method for data association and sensor fusion, Monte-Carlo JPDAF and a data association with IPPF (DA-IPPF) are implemented in the framework of particle filtering. The proposed algorithms and framework are applied to three different tracking scenarios of acoustic source tracking, visual source tracking, and joint acoustic-visual source tracking. Finally the implementation of this joint acoustic-visual tracking system using cameras and microphones is introduced in two parts of system implementation and real-time processing.
展开▼