首页> 外文OA文献 >Multi-task deep neural network acoustic models with model adaptation using discriminative speaker identity for whisper recognition
【2h】

Multi-task deep neural network acoustic models with model adaptation using discriminative speaker identity for whisper recognition

机译:具有判别性说话人身份的模型自适应的多任务深度神经网络声学模型用于耳语识别

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

This paper presents a study on large vocabulary continuous whisper automatic recognition (wLVCSR). wLVCSR provides the ability to use ASR equipment in public places without concern for disturbing others or leaking private information. However the task of wLVCSR is much more challenging than normal LVCSR due to the absence of pitch which not only causes the signal to noise ratio (SNR) of whispers to be much lower than normal speech but also leads to flatness and formant shifts in whisper spectra. Furthermore, the amount of whisper data available for training is much less than for normal speech. In this paper, multi-task deep neural network (DNN) acoustic models are deployed to solve these problems. Moreover, model adaptation is performed on the multi-task DNN to normalize speaker and environmental variability in whispers based on discriminative speaker identity information. On a Mandarin whisper dictation task, with 55 hours of whisper data, the proposed SI multi-task DNN model can achieve 56.7% character error rate (CER) improvement over a baseline Gaussian Mixture Model (GMM), discriminatively trained only using the whisper data. Besides, the CER of the proposed model for normal speech can reach 15.2%, which is close to the performance of a state-of-the-art DNN trained with one thousand hours of speech data. From this baseline, the model-adapted DNN gains a further 10.9% CER reduction over the generic model.
机译:本文提出了对大词汇量连续耳语自动识别(wLVCSR)的研究。 wLVCSR提供了在公共场所使用ASR设备的能力,而无需担心打扰他人或泄露私人信息。但是,由于没有音调,因此wLVCSR的任务比普通LVCSR更具挑战性,这不仅会导致耳语的信噪比(SNR)比正常语音低很多,而且会导致耳语频谱的平坦度和共振峰偏移。此外,可用于训练的耳语数据量比正常语音少得多。本文采用多任务深度神经网络(DNN)声学模型来解决这些问题。而且,在多任务DNN上执行模型自适应,以基于区分性说话人身份信息对说话人的说话人和环境可变性进行标准化。在使用55小时的耳语数据的普通话耳语听写任务中,与仅使用耳语数据进行有区别的训练的基线高斯混合模型(GMM)相比,拟议的SI多任务DNN模型可以实现56.7%的字符错误率(CER)改善。 。此外,所提出的正常语音模型的CER可以达到15.2%,接近具有千小时语音数据训练的最新DNN的性能。从该基准开始,与通用模型相比,适用于模型的DNN的CER进一步降低了10.9%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号