A single audio-visual automated speech recognition mode! (200) for transcribing speech from audio-visual data (204) includes an encoder frontend (260) and a decoder (280). The encoder frontend includes an attention mechanism (270) configured to receive an audio track (210) and a video portion (220) of the audio-visual data The video portion includes a plurality of video face tracks (230) each associated with a face of a respective person. For each video free track, the attention mechanism is configured to determine a confidence score indicating a likelihood that the face of the respective person associated with the video face track includes a speaking fr.ce of the audio hack. The decoder is configured to process the audio track and the video face track of the plurality of video face tracks associated with the highest confidence score to determine a speech recognition result (248) of the audio track.
展开▼