Random forests and the data sparseness problem in language modeling.

机译：语言建模中的随机森林和数据稀疏问题。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Language modeling is the problem of predicting words based on histories containing words already seen. Two key aspects of language modeling are effective history equivalence classification and robust probability estimation. The data sparseness problem associated with language modeling arises from these two aspects. Although works have been done in both aspects separately, few have shown solutions that aim at them at the same time.; We explore the use of Random Forests (RFs) in language modeling to deal with the two key aspects jointly. The goal in this work is to develop a new language model smoothing technique based on randomly grown Decision Trees (DTs) and apply the resulting RF language models to automatic speech recognition. This new technique is complementary to many of the existing techniques dealing with data sparseness problem.; After presenting our approach to efficient DT construction, we study our RF approach in the context of n-gram type language modeling in which n-1 words are present in a history. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories have more than four words. We show that our RF language models are superior to the best known smoothing technique, the interpolated Kneser-Ney smoothing, in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary speech recognition systems. In particular, we will show statistically significant improvements in a contemporary conversational telephony speech recognition system by applying the RF approach only to one of its many language models.; The new technique developed in this work is general. We will show that it works well when combined with other techniques, including word clustering and the structured language model (SLM).

机译：语言建模是基于包含已见单词的历史预测单词的问题。语言建模的两个关键方面是有效历史等效分类和鲁棒概率估计。与语言建模相关的数据稀疏性问题是由这两个方面引起的。尽管这两个方面的工作都是分别完成的，但很少有针对这两个方面的解决方案。我们探索在语言建模中使用随机森林（RF）来共同处理这两个关键方面。这项工作的目的是基于随机增长的决策树（DT）开发一种新的语言模型平滑技术，并将所得的RF语言模型应用于自动语音识别。这种新技术是对许多现有的处理数据稀疏问题的技术的补充。在介绍了有效的DT构建方法之后，我们将在n-gram类型语言建模的背景下研究RF方法，在该模型中，历史中存在n-1个单词。与常规的n-gram语言模型不同，即使历史记录包含四个以上的单词，RF语言模型也有可能很好地泛化到看不见的数据。我们证明，在减少大型词汇语音识别系统中的困惑度（PPL）和单词错误率（WER）方面，我们的RF语言模型优于已知的平滑技术（内插Kneser-Ney平滑）。特别是，通过仅将RF方法应用于其多种语言模型之一，我们将显示出现代会话电话语音识别系统在统计上的显着改进。在这项工作中开发的新技术是通用的。我们将证明它与其他技术（包括单词聚类和结构化语言模型（SLM））结合使用时效果很好。

著录项

作者
Xu, Peng.;
展开▼
作者单位

The Johns Hopkins University.;

展开▼
授予单位 The Johns Hopkins University.;
学科 Engineering Electronics and Electrical.
学位 Ph.D.
年度 2005
页码 119 p.
总页数 119
原文格式 PDF
正文语种 eng
中图分类无线电电子学、电信技术;
关键词

相似文献

外文文献
中文文献
专利

1. Random forests and the data sparseness problem in language modeling [J] . Peng Xu, Frederick Jelinek Computer speech and language . 2007,第1期

机译：语言建模中的随机森林和数据稀疏问题
2. National Forest Inventory and forest observational studies in Spain: applications to forest modeling. (Special Issue: Forest observational studies: "Data sources for analysing forest structure and dynamics".) [J] . Alvarez-Gonzalez J. G., Canellas I., Alberdi I., Forest Ecology and Management . 2014,第Null期

机译：西班牙的国家森林清单和森林观测研究：在森林建模中的应用。（特刊：森林观测研究：“用于分析森林结构和动态的数据源”。）
3. Identification of RR Lyrae Stars in Multiband, Sparsely Sampled Data from the Dark Energy Survey Using Template Fitting and Random Forest Classification [J] . K. M. Stringer, J. P. Long, L. M. Macri, The astronomical journal . 2019,第1期

机译：使用模板拟合和随机森林分类识别多频段中的RR Lyrae恒星，从黑暗能量调查中稀疏地采样数据
4. Hand detection in American Sign Language depth data using domain-driven random forest regression [C] . Zafrulla Zahoor, Sahni Himanshu, Bedri Abdelkareem, IEEE International Conference and Workshops on Automatic Face and Gesture Recognition . 2015

机译：使用域驱动的随机森林回归在美国手语深度数据中进行手检测
5. Sparse grid stochastic collocation techniques for the numerical solution of partial differential equations with random input data [D] . Webster, Clayton G. 2007

机译：带有随机输入数据的偏微分方程数值解的稀疏网格随机配置技术
6. Evaluating parameters for ligand-based modeling with random forest on sparse data sets [O] . Alexander Kensert, Jonathan Alvarsson, Ulf Norinder, 2018

机译：在稀疏数据集上评估基于配体的随机森林建模参数
7. Exploiting random projections and sparsity with random forests and gradient boosting methods - Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity [O] . Joly, Arnaud 2017

机译：利用随机森林和梯度提升方法开发随机投影和稀疏度-应用于多标签和多输出学习，随机森林模型压缩和利用输入稀疏度
8. Using Written Language Training Data for Spoken Language Modeling. [R] . Schwartz, R., Nguyen, L., Kubala, F., 1994

机译：使用书面语言训练数据进行口语建模。

Random forests and the data sparseness problem in language modeling.

摘要

著录项

相似文献

相关主题

期刊订阅