首页> 外文期刊>Computer speech and language >Random forests and the data sparseness problem in language modeling
【24h】

Random forests and the data sparseness problem in language modeling

机译:语言建模中的随机森林和数据稀疏问题

获取原文
获取原文并翻译 | 示例
       

摘要

Language modeling is the problem of predicting words based on histories containing words already hypothesized. Two key aspects of language modeling are effective history equivalence classification and robust probability estimation. The solution of these aspects is hindered by the data sparseness problem. Application of random forests (RFs) to language modeling deals with the two aspects simultaneously. We develop a new smoothing technique based on randomly grown decision trees (DTs) and apply the resulting RF language models to automatic speech recognition. This new method is complementary to many existing ones dealing with the data sparseness problem. We study our RF approach in the context of n-gram type language modeling in which n — 1 words are present in a history. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories are longer than four words. We show that our RF language models are superior to the best known smoothing technique, the interpolated Kneser-Ney smoothing, in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary state-of-the-art speech recognition systems. In particular, we will show statistically significant improvements in a contemporary conversational telephony speech recognition system by applying the RF approach only to one of its many language models.
机译:语言建模是基于包含已假设单词的历史预测单词的问题。语言建模的两个关键方面是有效历史等效分类和鲁棒概率估计。这些方面的解决方案受到数据稀疏性问题的阻碍。随机森林(RF)在语言建模中的应用同时处理了这两个方面。我们基于随机增长的决策树(DT)开发了一种新的平滑技术,并将所得的RF语言模型应用于自动语音识别。这种新方法是对许多现有的处理数据稀疏问题的方法的补充。我们在n-gram类型语言建模的背景下研究RF方法,在该模型中,历史中存在n-1个单词。与常规n-gram语言模型不同,RF语言模型有可能很好地泛化到看不见的数据,即使历史记录超过四个单词也是如此。我们展示了我们的RF语言模型在减少最先进的词汇量语音识别中的困惑度(PPL)和单词错误率(WER)方面均优于最著名的平滑技术内插Kneser-Ney平滑系统。特别是,通过仅将RF方法应用于其多种语言模型之一,我们将显示出现代会话电话语音识别系统在统计上的显着改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号