首页> 外文期刊>Procedia Computer Science >Textual Data Selection for Language Modelling in the Scope of Automatic Speech Recognition
【24h】

Textual Data Selection for Language Modelling in the Scope of Automatic Speech Recognition

机译:自动语音识别范围内用于语言建模的文本数据选择

获取原文
获取外文期刊封面目录资料

摘要

The language model is an important module in many applications that produce natural language text, in particular speech recognition. Training of language models requires large amounts of textual data that matches with the target domain. Selection of target domain (or in-domain) data has been investigated in the past. For example [1] has proposed a criterion based on the difference of cross-entropy between models representing in-domain and non-domain-specific data. However evaluations were conducted using only two sources of data, one corresponding to the in-domain, and another one to generic data from which sentences are selected. In the scope of broadcast news and TV shows transcription systems, language models are built by interpolating several language models estimated from various data sources. This paper investigates the data selection process in this context of building interpolated language models for speech transcription. Results show that, in the selection process, the choice of the language models for representing in-domain and non-domain-specific data is critical. Moreover, it is better to apply the data selection only on some selected data sources. This way, the selection process leads to an improvement of 8.3 in terms of perplexity and 0.2% in terms of word-error rate on the French broadcast transcription task.
机译:语言模型是产生自然语言文本(尤其是语音识别)的许多应用程序中的重要模块。语言模型的训练需要大量与目标域匹配的文本数据。过去已经研究了目标域(或域内)数据的选择。例如,[1]基于表示域内和非域特定数据的模型之间的交叉熵差异,提出了一个准则。但是,评估仅使用两种数据源进行,一种对应于域内,另一种针对从中选择句子的通用数据。在广播新闻和电视节目转录系统的范围内,语言模型是通过对从各种数据源估计的几种语言模型进行插值而建立的。本文研究了在构建用于语音转录的内插语言模型的情况下的数据选择过程。结果表明,在选择过程中,用于表示域内和非域特定数据的语言模型的选择至关重要。此外,最好仅将数据选择应用于某些选定的数据源。这样,在法国广播转录任务中,选择过程使困惑度提高了8.3,单词错误率提高了0.2%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号