首页> 外国专利> END-TO-END MULTI-SPEAKER AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION

END-TO-END MULTI-SPEAKER AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION

机译：端到端多扬声器视听自动语音识别

页面导航

摘要
著录项
相似文献

摘要

A single audio-visual automated speech recognition mode! (200) for transcribing speech from audio-visual data (204) includes an encoder frontend (260) and a decoder (280). The encoder frontend includes an attention mechanism (270) configured to receive an audio track (210) and a video portion (220) of the audio-visual data The video portion includes a plurality of video face tracks (230) each associated with a face of a respective person. For each video free track, the attention mechanism is configured to determine a confidence score indicating a likelihood that the face of the respective person associated with the video face track includes a speaking fr.ce of the audio hack. The decoder is configured to process the audio track and the video face track of the plurality of video face tracks associated with the highest confidence score to determine a speech recognition result (248) of the audio track.

机译：单个视听自动化语音识别模式！（200）用于从视听数据（204）的转录语音包括编码器前端（260）和解码器（280）。编码器前端包括注意机构（270），被配置为接收音频轨道（210）和视频部分的视频部分（220），视频部分包括多个与面相关联的视频面轨道（230）一个相应的人。对于每个视频自由轨道，注意机构被配置为确定置信度评分，其表示与视频面轨道相关联的各个人的面部包括讲述音频黑客的讲话的可能性。解码器被配置为处理与最高置信度分数相关联的多个视频面轨道的音频轨道和视频面部轨道，以确定音频轨道的语音识别结果（248）。

著录项

公开/公告号WO2021076349A1

专利类型
公开/公告日2021-04-22

原文格式PDF
申请/专利权人 GOOGLE LLC;
展开▼

申请/专利号WO2020US54162
发明设计人 BRAGA OTAVIO;
展开▼

申请日2020-10-02
分类号G10L15/25;G10L15/16;
国家 US
入库时间 2022-08-24 18:22:09

相似文献

专利
外文文献
中文文献