首页> 外文会议>IEEE Automatic Speech Recognition and Understanding Workshop >A Unified Endpointer Using Multitask and Multidomain Training
【24h】

A Unified Endpointer Using Multitask and Multidomain Training

机译:使用多任务和多域培训的统一端点

获取原文

摘要

In speech recognition systems, we generally differentiate the role of endpointers between long-form speech and voice queries, where they are responsible for speech detection and query endpoint detection respectively. Detection of speech is useful for segmentation and pre-filtering in long-form speech processing. On the other hand, query endpoint detection predicts when to stop listening and send audio received so far for actions. It thus determines system latency and is an essential component for interactive voice systems. For both tasks, endpointer needs to be robust in challenging environments, including noisy conditions, reverberant environments and environments with background speech, and it has to generalize well to different domains with different speaking styles and rhythms. This work investigates building a unified endpointer by folding the separate speech detection and query endpoint detection tasks into a single neural network model through multitask learning. A categorical domain representation is further incorporated into the model to encourage learning domain specific information. The final unified model achieves around 100 ms (18% relatively) latency improvement for near-field voice queries and 150 ms (21% relatively) for far-field voice queries over simply pooling all the data together and 7% relative frame error rate reduction for long-form speech compared to a standalone speech detection model. The proposed approach also shows good robustness to noisy environments and yields 180 ms latency improvement on voice queries from an unseen domain.
机译:在语音识别系统中,我们通常将终结者的角色区别于长形式的语音和语音查询,它们分别负责语音检测和查询终结点检测。语音检测对于长格式语音处理中的分段和预过滤很有用。另一方面,查询端点检测可预测何时停止收听并发送到目前为止已收到的声音以进行操作。因此,它确定系统等待时间,并且是交互式语音系统的基本组件。对于这两项任务,终结者都必须在具有挑战性的环境(包括嘈杂的环境,混响的环境以及具有背景语音的环境)中保持强大,并且必须将其很好地推广到具有不同讲话风格和节奏的不同领域。这项工作研究了如何通过多任务学习将单独的语音检测和查询端点检测任务折叠到单个神经网络模型中来构建统一的端点终结器。将分类领域表示形式进一步合并到模型中,以鼓励学习特定于领域的信息。最终的统一模型通过简单地将所有数据集中在一起,实现了近场语音查询的延迟大约100毫秒(相对18%)的改善,远场语音查询的延迟150毫秒(相对21%)的改善,相对帧错误率降低了7%与独立语音检测模型相比,适用于长格式语音。所提出的方法还显示出对嘈杂环境的良好鲁棒性,并在来自看不见域的语音查询上产生了180 ms的延迟改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号