首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection
【24h】

Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection

机译:为基于深度神经网络的语音活动检测提升上下文信息

获取原文
获取原文并翻译 | 示例
           

摘要

Voice activity detection (VAD) is an important topic in audio signal processing. Contextual information is important for improving the performance of VAD at low signal-to-noise ratios. Here we explore contextual information by machine learning methods at three levels. At the top level, we employ an ensemble learning framework, named multi-resolution stacking (MRS), which is a stack of ensemble classifiers. Each classifier in a building block inputs the concatenation of the predictions of its lower building blocks and the expansion of the raw acoustic feature by a given window (called a resolution). At the middle level, we describe a base classifier in MRS, named boosted deep neural network (bDNN). bDNN first generates multiple base predictions from different contexts of a single frame by only one DNN and then aggregates the base predictions for a better prediction of the frame, and it is different from computationally-expensive boosting methods that train ensembles of classifiers for multiple base predictions. At the bottom level, we employ the multi-resolution cochleagram feature, which incorporates the contextual information by concatenating the cochleagram features at multiple spectrotemporal resolutions. Experimental results show that the MRS-based VAD outperforms other VADs by a considerable margin. Moreover, when trained on a large amount of noise types and a wide range of signal-to-noise ratios, the MRS-based VAD demonstrates surprisingly good generalization performance on unseen test scenarios, approaching the performance with noise-dependent training.
机译:语音活动检测(VAD)是音频信号处理中的重要主题。上下文信息对于在低信噪比下提高VAD性能至关重要。在这里,我们通过三个级别的机器学习方法来探索上下文信息。在顶层,我们采用了集成学习框架,称为多分辨率堆栈(MRS),它是集成分类器的堆栈。构造块中的每个分类器输入其较低构造块的预测的级联以及原始声学特征通过给定窗口(称为分辨率)的扩展。在中间层,我们描述了MRS中的基本分类器,称为增强型深度神经网络(bDNN)。 bDNN首先仅通过一个DNN在单个帧的不同上下文中生成多个基本预测,然后将基本预测进行汇总以更好地预测该帧,这不同于为多个基本预测训练分类器集合的计算昂贵的增强方法。在最底层,我们采用了多分辨率耳蜗图功能,该功能通过将耳蜗图功能以多个光谱时分辨率串联在一起来合并上下文信息。实验结果表明,基于MRS的VAD远远优于其他VAD。此外,当在大量的噪声类型和广泛的信噪比上进行训练时,基于MRS的VAD在看不见的测试场景中表现出令人惊讶的良好泛化性能,并通过依赖于噪声的训练来接近该性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号