首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Leveraging Frequency-Dependent Kernel and DIP-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams
【24h】

Leveraging Frequency-Dependent Kernel and DIP-Based Clustering for Robust Speech Activity Detection in Naturalistic Audio Streams

机译:利用基于频率的内核和基于DIP的聚类在自然音频流中进行健壮的语音活动检测

获取原文
获取原文并翻译 | 示例

摘要

Speech activity detection (SAD) is front-end in most speech systems, e.g., speaker verification, speech recognition etc. Supervised SAD typically leverages machine learning models trained on annotated data. For applications like zero-resource speech processing and NIST-OpenSAT-2017 public safety communications task, it might not be feasible to collect SAD annotations. SAD is challenging for naturalistic audio streams containing multiple noise-sources simultaneously. We propose a novel frequency-dependent kernel (FDK) based SAD features. FDK provides enhanced spectral decomposition from which several statistical descriptors are derived. FDK statistical descriptors are combined by principal component analysis into one-dimensional FDK-SAD features. We further proposed two decision backends: First, variable model-size Gaussian mixture model (VMGMM); and second, Hartigan dip-based robust feature clustering. While VMGMM is a model-based approach, the DipSAD is nonparametric. We used both backends for comparative evaluations in two phases: first, standalone SAD performance; and second, the effect of SAD on text-dependent speaker verification using RedDots data. The NIST-OpenSAD-2015 and NIST-OpenSAT-2017 corpora are used for standalone SAD evaluations. We establish two Center for Robust Speech Systems (CRSS) corpora namely CRSS-PLTL-II and CRSS long-duration naturalistic noise corpus. The CRSS corpora facilitate standalone SAD evaluations on naturalistic audio streams. We performed comparative studies of the proposed approaches with multiple baselines including SohnSAD, rSAD, semisupervised Gaussian mixture model, and Gammatone spectrogram features.
机译:语音活动检测(SAD)是大多数语音系统(例如说话者验证,语音识别等)的前端。受监督的SAD通常利用对带注释数据进行训练的机器学习模型。对于零资源语音处理和NIST-OpenSAT-2017公共安全通信任务等应用,收集SAD注释可能不可行。 SAD对于同时包含多个噪声源的自然主义音频流具有挑战性。我们提出了一种新颖的基于频率的内核(FDK)基于SAD的功能。 FDK提供增强的频谱分解,从中可以导出几个统计描述符。 FDK统计描述符通过主成分分析组合为一维FDK-SAD功能。我们进一步提出了两个决策后端:首先,模型尺寸可变的高斯混合模型(VMGMM);第二,基于Hartigan倾斜的鲁棒特征聚类。尽管VMGMM是基于模型的方法,但DipSAD是非参数的。我们在两个阶段将两个后端用于比较评估:首先,独立的SAD性能;其次,SAD对使用RedDots数据的文本相关说话人验证的影响。 NIST-OpenSAD-2015和NIST-OpenSAT-2017语料库用于独立的SAD评估。我们建立了两个强大的语音系统中心(CRSS)语料库,即CRSS-PLTL-II和CRSS长期自然噪声语料库。 CRSS语料库有助于对自然音频流进行独立的SAD评估。我们对提出的方法进行了比较研究,这些方法具有多个基线,包括SohnSAD,rSAD,半监督高斯混合模型和Gammatone频谱图特征。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号