首页> 外文期刊>Circuits, systems, and signal processing >An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children's Speech Recognition
【24h】

An Experimental Study on the Significance of Variable Frame-Length and Overlap in the Context of Children's Speech Recognition

机译:儿童语音识别中可变帧长和重叠的意义的实验研究

获取原文
获取原文并翻译 | 示例

摘要

It is well known that the recognition performance of an automatic speech recognition (ASR) system is affected by intra-speaker as well inter-speaker variability. The differences in the geometry of vocal organs, pitch and speaking-rate among the speakers are some such inter-speaker variabilities affecting the recognition performance. A mismatch between the training and test data with respect to any of those aforementioned factors leads to increased error rates. An example of acoustically mismatched ASR is the task of transcribing children's speech on adult data-trained system. A large number of studies have been reported earlier that present a myriad of techniques for addressing acoustic mismatch arising from differences in pitch and dimensions of vocal organs. At the same time, only a few works on speaking-rate adaptation employing timescale modification have been reported. Furthermore, those studies were performed on ASR systems developed using Gaussian mixture models. Motivated by these facts, speaking-rate adaptation is explored in this work in the context of children's ASR system employing deep neural network-based acoustic modeling. Speaking-rate adaptation is performed by changing the frame-length and overlap during front-end feature extraction process. Significant reductions in errors are noted by speaking-rate adaptation. In addition to that, we have also studied the effect of combining speaking-rate adaptation with vocal-tract length normalization and explicit pitch modification. In both the cases, additive improvements are obtained. To summarize, relative improvements in 15-20% over the baselines are obtained by varying the frame-length and frame-overlap.
机译:众所周知,自动语音识别(ASR)系统的识别性能受扬声器内以及扬声器间可变性的影响。说话者之间的声音器官的几何形状,音调和发声率的差异是一些这样的说话者间差异,它们影响识别性能。关于任何上述因素,训练数据和测试数据之间的不匹配会导致错误率增加。声学上不匹配的ASR的一个示例是在成人数据培训系统上转录儿童语音的任务。早些时候已经报道了许多研究,这些研究提出了许多技术来解决由声器官的音高和尺寸的差异引起的声学失配。同时,仅报道了几篇有关采用时标修改的语速自适应的著作。此外,这些研究是在使用高斯混合模型开发的ASR系统上进行的。基于这些事实,在采用基于深度神经网络的声学建模的儿童ASR系统的背景下,探讨了语速适应性。通过在前端特征提取过程中更改帧长和重叠来执行语速适配。语速调整可显着减少错误。除此之外,我们还研究了将语速匹配与声道长度归一化和明确音高修改相结合的效果。在这两种情况下,均可获得加性改进。综上所述,通过改变帧长和帧重叠,可以获得相对于基线的15-20%的相对改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号