首页> 外文期刊>Computer speech and language >An information fusion framework with multi-channel feature concatenation and multi-perspective system combination for the deep-learning-based robust recognition of microphone array speech
【24h】

An information fusion framework with multi-channel feature concatenation and multi-perspective system combination for the deep-learning-based robust recognition of microphone array speech

机译:具有多通道特征串联和多视角系统组合的信息融合框架,用于基于深度学习的麦克风阵列语音鲁棒识别

获取原文
获取原文并翻译 | 示例
       

摘要

We present an information fusion approach to the robust recognition of multi-microphone speech. It is based on a deep learning framework with a large deep neural network (DNN) consisting of subnets designed from different perspectives. Multiple knowledge sources are then reasonably integrated via an early fusion of normalized noisy features with multiple beamforming techniques, enhanced speech features, speaker-related features, and other auxiliary features concatenated as the input to each subnet to compensate for imperfect front-end processing. Furthermore, a late fusion strategy is utilized to leverage the complementary natures of the different subnets by combining the outputs of all subnets to produce a single output set. Testing on the CHiME-3 task of recognizing microphone array speech, we demonstrate in our empirical study that the different information sources complement each other and that both early and late fusions provide significant performance gains, with an overall word error rate of 10.55% when combining 12 systems. Furthermore, by utilizing an improved technique for beamforming and a powerful recurrent neural network (RNN)-based language model for rescoring, a WER of 9.08% can be achieved for the best single DNN system with one-pass decoding among all of the systems submitted to the CHiME-3 challenge.
机译:我们提出了一种信息融合方法,可以对多麦克风语音进行可靠的识别。它基于具有大型深度神经网络(DNN)的深度学习框架,该深度神经网络由从不同角度设计的子网组成。然后,通过将归一化的噪声特征与多种波束成形技术,增强的语音特征,与说话者相关的特征以及其他辅助特征进行早期融合,合理地整合多个知识源,作为每个子网的输入,以补偿不完善的前端处理。此外,后期融合策略用于通过组合所有子网的输出以生成单个输出集来利用不同子网的互补性质。通过对CHiME-3识别麦克风阵列语音的任务进行测试,我们在实证研究中证明,不同的信息源可以相互补充,并且早期和晚期融合都可以显着提高性能,结合时总字错误率达到10.55% 12个系统。此外,通过利用改进的波束成形技术和基于强大的递归神经网络(RNN)的语言模型进行记录,对于所有提交的系统中具有一遍解码的最佳单DNN系统,WER可以达到9.08%。应对CHiME-3挑战。

著录项

  • 来源
    《Computer speech and language》 |2017年第11期|517-534|共18页
  • 作者单位

    University of Science and Technology of China, Hefei, Anhui, China;

    University of Science and Technology of China, Hefei, Anhui, China;

    University of Science and Technology of China, Hefei, Anhui, China;

    University of Science and Technology of China, Hefei, Anhui, China;

    University of Science and Technology of China, Hefei, Anhui, China;

    Georgia Institute of Technology, Atlanta, GA, United States;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    CHiME challenge; Deep learning; Information fusion; Microphone array; Robust speech recognition;

    机译:CHiME挑战;深度学习;信息融合;麦克风阵列;强大的语音识别;
  • 入库时间 2022-08-18 02:11:09

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号