首页> 外文期刊>Computer speech and language >An improved two-stage mixed language model approach for handling out-of-vocabulary words in large vocabulary continuous speech recognition
【24h】

An improved two-stage mixed language model approach for handling out-of-vocabulary words in large vocabulary continuous speech recognition

机译:一种改进的两阶段混合语言模型方法,用于处理大词汇量连续语音识别中的词汇外单词

获取原文
获取原文并翻译 | 示例
       

摘要

This paper presents a two-stage mixed language model technique for detecting and recognizing words that are not included in the vocabulary of a large vocabulary continuous speech recognition system. The main idea is to spot the out-of-vocabulary words and to produce a transcription for these words in terms of subword units with the help of a mixed word/subword language model in the first stage, and to convert the subword transcriptions to word hypotheses by means of a look-up table in the second stage. The performance of the proposed approach is compared to that of the state-of-the-art hybrid method reported in the literature, both on in-domain and on out-of-domain Dutch spoken material, where the term 'domain' refers to the ensemble of topics that were covered in the material from which the lexicon and language model were retrieved. It turns out that the proposed approach is at least equally effective as a hybrid approach when it comes to recognizing in-domain material, and significantly more effective when applied to out-of-domain data. This proves that the proposed approach is easily adaptable to new domains and to new words (e.g. proper names) in the same domain. On the out-of-domain recognition task, the word error rate could be reduced by 12% relative over a baseline system incorporating a 100k word vocabulary and a basic garbage OOV word model.
机译:本文提出了一种两阶段混合语言模型技术,用于检测和识别大型词汇连续语音识别系统的词汇中不包含的单词。主要思想是在第一阶段借助混合词/子词语言模型来发现词汇外的词,并以子词为单位为这些词产生转录,并将子词的转录转换为词第二阶段通过查找表进行假设。在域内和域外荷兰语口语材料中,将所提出的方法的性能与文献中报道的最新混合方法的性能进行了比较,其中术语“域”是指检索词典和语言模型的材料中涵盖的主题的整体。事实证明,在识别域内材料时,建议的方法至少与混合方法等效,而应用于域外数据则明显更有效。这证明了所提出的方法很容易适应于新的域和同一域中的新词(例如专有名称)。在域外识别任务上,相对于包含100k单词词汇和基本垃圾OOV单词模型的基准系统,单词错误率可以降低12%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号