首页> 外国专利> MULTI-MODAL SPEECH LOCALIZATION

MULTI-MODAL SPEECH LOCALIZATION

机译：多模态语音定位

页面导航

摘要
著录项
相似文献

摘要

Multi-modal speech localization is achieved using image data captured by one or more cameras, and audio data captured by a microphone array. Audio data captured by each microphone of the array is transformed to obtain a frequency domain representation that is discretized in a plurality of frequency intervals. Image data captured by each camera is used to determine a positioning of each human face. Input data is provided to a previously-trained, audio source localization classifier, including: the frequency domain representation of the audio data captured by each microphone, and the positioning of each human face captured by each camera in which the positioning of each human face represents a candidate audio source. An identified audio source is indicated by the classifier based on the input data that is estimated to be the human face from which the audio data originated.

机译：使用由一个或多个摄像机捕获的图像数据和由麦克风阵列捕获的音频数据实现的多模模式语音定位。阵列的每个麦克风捕获的音频数据被变换以获得以多个频率间隔离散化的频域表示。每个摄像机捕获的图像数据用于确定每个人脸的定位。输入数据被提供给先前训练的音频源定位分类器，包括：由每个麦克风捕获的音频数据的频域表示，以及每个人面的每个摄像机捕获的每个人脸的定位候选音频源。识别的音频源由分类器基于估计是来自音频数据源自的人脸的输入数据来指示。

著录项

公开/公告号EP3791391A1

专利类型
公开/公告日2021-03-17

原文格式PDF
申请/专利权人 MICROSOFT TECHNOLOGY LICENSING LLC;
展开▼

申请/专利号EP20190723577
发明设计人 KRUPKA EYAL;XIAO XIONG;
展开▼

申请日2019-04-30
分类号G10L17/10;G10L17;G06K9;H04N7/15;H04N5/232;
国家 EP
入库时间 2024-06-14 21:22:20

相似文献

专利
外文文献
中文文献