首页> 外文期刊>Advanced Robotics: The International Journal of the Robotics Society of Japan >A real-time super-resolution robot audition system that improves the robustness of simultaneous speech recognition
【24h】

A real-time super-resolution robot audition system that improves the robustness of simultaneous speech recognition

机译:实时超分辨率机器人试听系统,可提高同时语音识别的鲁棒性

获取原文
获取原文并翻译 | 示例
           

摘要

This study addresses a framework for a robot audition system, including sound source localization (SSL) and sound source separation (SSS), that can robustly recognize simultaneous speeches in a real environment. Because SSL estimates not only the location of speakers but also the number of speakers, such a robust framework is essential for simultaneous speech recognition. Moreover, improvement in the performance of SSS is crucial for simultaneous speech recognition because the robot has to recognize the individual source of speeches. For simultaneous speech recognition, current robot audition systems mainly require noise-robustness, high resolution, and real-time implementation. Multiple signal classification (MUSIC) based on standard Eigenvalue decomposition (SEVD) and Geometric-constrained high-order decorrelation-based source separation (GHDSS) are techniques utilizing microphone array processing, which are used for SSL and SSS, respectively. To enhance SSL robustness against noise while detecting simultaneous speeches, we improved SEVD-MUSIC by incorporating generalized Eigenvalue decomposition (GEVD). However, GEVD-based MUSIC (GEVD-MUSIC) and GHDSS mainly have two issues: (1) the resolution of pre-measured transfer functions (TFs) determines the resolution of SSL and SSS and (2) their computational cost is expensive for real-time processing. For the first issue, we propose a TF-interpolation method integrating time-domain-based and frequency-domain-based interpolation. The interpolation achieves super-resolution robot audition, which has a higher resolution than that of the pre-measured TFs. For the second issue, we propose two methods for SSL: MUSIC based on generalized singular value decomposition (GSVD-MUSIC) and hierarchical SSL (H-SSL). GSVD-MUSIC drastically reduces the computational cost while maintaining noise-robustness for localization. In addition, H-SSLreduces the computational cost by introducing a hierarchical search algorithm instead of using a greedy search for localization. These techniques are integrated into a robot audition system using a robot-embedded microphone array. The preliminary experiments for each technique showed the following: (1) The proposed interpolation achieved approximately 1-degree resolution although the TFs are only at 30-degree intervals in both SSL and SSS; (2) GSVD-MUSIC attained 46.4 and 40.6% of the computational cost compared to that of SEVD-MUSIC and GEVD-MUSIC, respectively; (3) H-SSL reduced 71.7% of the computational cost to localize a single speaker. Finally, the robot audition system, including super-resolution SSL and SSS, is applied to robustly recognize four sources of speech occurring simultaneously in a real environment. The proposed system showed considerable performance improvements of up to 7% for the average word correct rate during simultaneous speech recognition, especially when the TFs were at more than 30-degree intervals.
机译:这项研究提出了一种机器人试听系统的框架,其中包括声源定位(SSL)和声源分离(SSS),它们可以在真实环境中可靠地识别同步语音。因为SSL不仅估计说话者的位置,而且估计说话者的数量,所以这种健壮的框架对于同时进行语音识别至关重要。此外,提高SSS的性能对于同时进行语音识别至关重要,因为机器人必须识别单个语音源。为了同时进行语音识别,当前的机器人试听系统主要需要鲁棒性,高分辨率和实时实现。基于标准特征值分解(SEVD)和基于几何约束的高阶基于去相关的源分离(GHDSS)的多信号分类(MUSIC)是利用麦克风阵列处理的技术,分别用于SSL和SSS。为了在检测同步语音的同时增强SSL抗噪声能力,我们通过合并广义特征值分解(GEVD)改进了SEVD-MUSIC。但是,基于GEVD的MUSIC(GEVD-MUSIC)和GHDSS主要存在两个问题:(1)预先测量的传递函数(TF)的分辨率决定了SSL和SSS的分辨率,(2)它们的计算成本对于实际时间处理。对于第一个问题,我们提出了一种结合基于时域和基于频域的插值的TF插值方法。插值实现了超分辨率的机器人试听,该试听的分辨率比预先测量的TF的分辨率更高。对于第二个问题,我们提出了两种SSL方法:基于广义奇异值分解(GSVD-MUSIC)的MUSIC和分层SSL(H-SSL)。 GSVD-MUSIC大大降低了计算成本,同时保持了本地化的鲁棒性。另外,H-SSL通过引入分层搜索算法而不是使用贪婪搜索进行定位来降低计算成本。这些技术使用嵌入了机器人的麦克风阵列集成到机器人试听系统中。每种技术的初步实验显示如下:(1)尽管SSL和SSS中的TF仅以30度间隔,但所建议的插值实现了大约1度的分辨率; (2)GSVD-MUSIC的计算成本分别比SEVD-MUSIC和GEVD-MUSIC的计算成本高46.4%和40.6%; (3)H-SSL减少了本地化单个扬声器的计算成本,降低了71.7%。最终,包括超分辨率SSL和SSS在内的机器人试听系统被用于可靠地识别在真实环境中同时发生的四个语音源。拟议的系统在同时语音识别期间,尤其是当TF间隔超过30度时,对平均单词正确率显示了高达7%的显着性能改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号