首页> 外文期刊>Computer speech and language >Comparing human and automatic speech recognition in simple and complex acoustic scenes
【24h】

Comparing human and automatic speech recognition in simple and complex acoustic scenes

机译:在简单和复杂的声学场景中比较人和自动语音识别

获取原文
获取原文并翻译 | 示例

摘要

Former comparisons of human speech recognition (HSR) and automatic speech recognition (ASR) have shown that humans outperform ASR systems in nearly all speech recognition tasks. However, recent progress in ASR has led to substantial improvements of recognition accuracy, and it is therefore unclear how large the task-dependent human-machine gap still remains. This paper investigates this gap between HSR and ASR based on deep neural networks (DNNs) in different acoustic conditions, with the aim of comparing differences and identifying processing strategies that should be considered in ASR. We find that DNN-based ASR reaches human performance for single-channel, small-vocabulary tasks in the presence of speech-shaped noise and in multi-talker babble noise, which is an important difference to previous human-machine comparisons: The speech reception threshold, i.e., the signal-to-noise ratio with 50% word recognition rate is at about −7 to −8 dB both for HSR and ASR. However, in more complex spatial scenes with diffuse noise and moving talkers, the SRT gap amounts to approximately 12 dB. Based on cross comparisons that use oracle knowledge (e.g., the speakers’ true position), incorrect responses are attributed to localization errors or missing pitch information to distinguish between speakers with different gender. In terms of the SRT, localization errors and missing spectral information amount to 2.1 and 3.2 dB, respectively. The comparison hence identifies specific components in ASR that can profit from learning from auditory signal processing.
机译:人类语音识别(HSR)和自动语音识别(ASR)的先前比较表明,在几乎所有语音识别任务中,人类的性能都优于ASR系统。然而,ASR的最新进展已导致识别精度的显着提高,因此,尚不清楚任务相关的人机差距还剩下多大。本文研究了基于深层神经网络(DNN)在不同声学条件下HSR和ASR之间的差距,目的是比较差异并确定ASR中应考虑的处理策略。我们发现,基于DNN的ASR在存在语音形噪声和多方讲话者胡言乱语的情况下,可达到单通道,小词汇量任务的人类性能,这与以前的人机比较是一个重要区别:语音接收阈值,即对于HSR和ASR而言,具有50%单词识别率的信噪比约为-7至-8dB。但是,在具有散射噪声和移动讲话者的更复杂的空间场景中,SRT间隙约为12 dB。根据使用Oracle知识(例如,说话者的真实位置)的交叉比较,错误的回答归因于定位错误或音调信息丢失,以区分不同性别的说话者。就SRT而言,定位误差和频谱信息丢失分别为2.1 dB和3.2 dB。因此,该比较确定了ASR中的特定组件,这些组件可以从听觉信号处理的学习中受益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号