首页> 外文会议>IEEE Automatic Speech Recognition and Understanding Workshop >A Unified Endpointer Using Multitask and Multidomain Training
【24h】

A Unified Endpointer Using Multitask and Multidomain Training

机译:使用多任务和多域培训的统一终点

获取原文

摘要

In speech recognition systems, we generally differentiate the role of endpointers between long-form speech and voice queries, where they are responsible for speech detection and query endpoint detection respectively. Detection of speech is useful for segmentation and pre-filtering in long-form speech processing. On the other hand, query endpoint detection predicts when to stop listening and send audio received so far for actions. It thus determines system latency and is an essential component for interactive voice systems. For both tasks, endpointer needs to be robust in challenging environments, including noisy conditions, reverberant environments and environments with background speech, and it has to generalize well to different domains with different speaking styles and rhythms. This work investigates building a unified endpointer by folding the separate speech detection and query endpoint detection tasks into a single neural network model through multitask learning. A categorical domain representation is further incorporated into the model to encourage learning domain specific information. The final unified model achieves around 100 ms (18% relatively) latency improvement for near-field voice queries and 150 ms (21% relatively) for far-field voice queries over simply pooling all the data together and 7% relative frame error rate reduction for long-form speech compared to a standalone speech detection model. The proposed approach also shows good robustness to noisy environments and yields 180 ms latency improvement on voice queries from an unseen domain.
机译:在语音识别系统中,我们通常在长形语音和语音查询之间区分端点的角色,其中它们分别负责语音检测和查询端点检测。检测语音对于长形语音处理中的分割和预滤波是有用的。另一方面,查询端点检测预测到何时停止侦听和发送迄今为止动作的音频。因此,它决定了系统延迟,是交互式语音系统的重要组成部分。对于两个任务,端点需要在充满挑战环境中具有稳健性,包括嘈杂的条件,混响环境和具有背景语音的环境,并且它必须概括与不同讲话方式和节奏的不同域。通过多任务学习将单独的语音检测和查询端点检测任务折叠到单个神经网络模型中,调查构建统一的端点。分类域表示进一步纳入模型中以鼓励学习域特定信息。最终统一模型在近场语音查询和150毫秒(相对相对)的延迟改善大约100毫秒(相对18%),在简单地汇集所有数据以及7%相对帧错误率降低时与独立语音检测模型相比的长形语音。所提出的方法还对嘈杂的环境展示了良好的稳健性,并在看不见的域中产生180毫秒的语音查询延迟改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号