首页> 外文期刊>IEEE transactions on audio, speech and language processing >An Iterative Relative Entropy Minimization-Based Data Selection Approach for n-Gram Model Adaptation
【24h】

An Iterative Relative Entropy Minimization-Based Data Selection Approach for n-Gram Model Adaptation

机译:基于迭代相对熵最小化的n-Gram模型自适应数据选择方法

获取原文
获取原文并翻译 | 示例

摘要

Performance of statistical n-gram language models depends heavily on the amount of training text material and the degree to which the training text matches the domain of interest. The language modeling community is showing a growing interest in using large collections of text (obtainable, for example, from a diverse set of resources on the Internet) to supplement sparse in-domain resources. However, in most cases the style and content of the text harvested from the web differs significantly from the specific nature of these domains. In this paper, we present a relative entropy based method to select subsets of sentences whose n-gram distribution matches the domain of interest. We present results on language model adaptation using two speech recognition tasks: a medium vocabulary medical domain doctor-patient dialog system and a large vocabulary transcription system for European parliamentary plenary speeches (EPPS). We show that the proposed subset selection scheme leads to performance improvements over state of the art speech recognition systems in terms of both speech recognition word error rate (WER) and language model perplexity (PPL).
机译:统计n-gram语言模型的性能在很大程度上取决于训练文本材料的数量以及训练文本与感兴趣领域的匹配程度。语言建模社区对使用大量文本集合(例如,可以从Internet上的各种资源获得)来补充稀疏域内资源的兴趣日益浓厚。但是,在大多数情况下,从网络获取的文本的样式和内容与这些域的特定性质有很大不同。在本文中,我们提出了一种基于相对熵的方法,以选择n元语法分布与目标域匹配的句子子集。我们使用两个语音识别任务介绍语言模型适应的结果:用于欧洲议会全体会议演讲(EPPS)的中等词汇医学领域的医患对话系统和大型词汇转录系统。我们表明,所提出的子集选择方案在语音识别单词错误率(WER)和语言模型困惑度(PPL)方面都比现有的语音识别系统性能有所提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号