首页> 外文学位 >Random forests and the data sparseness problem in language modeling.
【24h】

Random forests and the data sparseness problem in language modeling.

机译:语言建模中的随机森林和数据稀疏问题。

获取原文
获取原文并翻译 | 示例

摘要

Language modeling is the problem of predicting words based on histories containing words already seen. Two key aspects of language modeling are effective history equivalence classification and robust probability estimation. The data sparseness problem associated with language modeling arises from these two aspects. Although works have been done in both aspects separately, few have shown solutions that aim at them at the same time.; We explore the use of Random Forests (RFs) in language modeling to deal with the two key aspects jointly. The goal in this work is to develop a new language model smoothing technique based on randomly grown Decision Trees (DTs) and apply the resulting RF language models to automatic speech recognition. This new technique is complementary to many of the existing techniques dealing with data sparseness problem.; After presenting our approach to efficient DT construction, we study our RF approach in the context of n-gram type language modeling in which n-1 words are present in a history. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories have more than four words. We show that our RF language models are superior to the best known smoothing technique, the interpolated Kneser-Ney smoothing, in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary speech recognition systems. In particular, we will show statistically significant improvements in a contemporary conversational telephony speech recognition system by applying the RF approach only to one of its many language models.; The new technique developed in this work is general. We will show that it works well when combined with other techniques, including word clustering and the structured language model (SLM).
机译:语言建模是基于包含已见单词的历史预测单词的问题。语言建模的两个关键方面是有效历史等效分类和鲁棒概率估计。与语言建模相关的数据稀疏性问题是由这两个方面引起的。尽管这两个方面的工作都是分别完成的,但很少有针对这两个方面的解决方案。我们探索在语言建模中使用随机森林(RF)来共同处理这两个关键方面。这项工作的目的是基于随机增长的决策树(DT)开发一种新的语言模型平滑技术,并将所得的RF语言模型应用于自动语音识别。这种新技术是对许多现有的处理数据稀疏问题的技术的补充。在介绍了有效的DT构建方法之后,我们将在n-gram类型语言建模的背景下研究RF方法,在该模型中,历史中存在n-1个单词。与常规的n-gram语言模型不同,即使历史记录包含四个以上的单词,RF语言模型也有可能很好地泛化到看不见的数据。我们证明,在减少大型词汇语音识别系统中的困惑度(PPL)和单词错误率(WER)方面,我们的RF语言模型优于已知的平滑技术(内插Kneser-Ney平滑)。特别是,通过仅将RF方法应用于其多种语言模型之一,我们将显示出现代会话电话语音识别系统在统计上的显着改进。在这项工作中开发的新技术是通用的。我们将证明它与其他技术(包括单词聚类和结构化语言模型(SLM))结合使用时效果很好。

著录项

  • 作者

    Xu, Peng.;

  • 作者单位

    The Johns Hopkins University.;

  • 授予单位 The Johns Hopkins University.;
  • 学科 Engineering Electronics and Electrical.
  • 学位 Ph.D.
  • 年度 2005
  • 页码 119 p.
  • 总页数 119
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 无线电电子学、电信技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号