Random forests and the data sparseness problem in language modeling

Peng Xu; Frederick Jelinek

首页> 外文期刊>Computer speech and language >Random forests and the data sparseness problem in language modeling

【24h】

Random forests and the data sparseness problem in language modeling

机译：语言建模中的随机森林和数据稀疏问题

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Language modeling is the problem of predicting words based on histories containing words already hypothesized. Two key aspects of language modeling are effective history equivalence classification and robust probability estimation. The solution of these aspects is hindered by the data sparseness problem. Application of random forests (RFs) to language modeling deals with the two aspects simultaneously. We develop a new smoothing technique based on randomly grown decision trees (DTs) and apply the resulting RF language models to automatic speech recognition. This new method is complementary to many existing ones dealing with the data sparseness problem. We study our RF approach in the context of n-gram type language modeling in which n — 1 words are present in a history. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories are longer than four words. We show that our RF language models are superior to the best known smoothing technique, the interpolated Kneser-Ney smoothing, in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary state-of-the-art speech recognition systems. In particular, we will show statistically significant improvements in a contemporary conversational telephony speech recognition system by applying the RF approach only to one of its many language models.

机译：语言建模是基于包含已假设单词的历史预测单词的问题。语言建模的两个关键方面是有效历史等效分类和鲁棒概率估计。这些方面的解决方案受到数据稀疏性问题的阻碍。随机森林（RF）在语言建模中的应用同时处理了这两个方面。我们基于随机增长的决策树（DT）开发了一种新的平滑技术，并将所得的RF语言模型应用于自动语音识别。这种新方法是对许多现有的处理数据稀疏问题的方法的补充。我们在n-gram类型语言建模的背景下研究RF方法，在该模型中，历史中存在n-1个单词。与常规n-gram语言模型不同，RF语言模型有可能很好地泛化到看不见的数据，即使历史记录超过四个单词也是如此。我们展示了我们的RF语言模型在减少最先进的词汇量语音识别中的困惑度（PPL）和单词错误率（WER）方面均优于最著名的平滑技术内插Kneser-Ney平滑系统。特别是，通过仅将RF方法应用于其多种语言模型之一，我们将显示出现代会话电话语音识别系统在统计上的显着改进。

著录项

来源
《Computer speech and language》 |2007年第1期|p.105-152|共48页
作者
Peng Xu; Frederick Jelinek;
展开▼
作者单位

Google Inc.;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
入库时间 2022-08-18 02:12:12

相似文献

外文文献
中文文献
专利

1. Random Forest-Based Prospectivity Modelling of Greenfield Terrains Using Sparse Deposit Data: An Example from the Tanami Region, Western Australia [J] . Siddharth Hariharan, Siddhesh Tirodkar, Alok Porwal Natural resources research . 2017,第4期

机译：稀疏存款数据的绿地地带随机林的勘探建模：澳大利亚西澳大利亚Tanami地区的一个例子
2. Field data for satellite validation and forest structure modeling in a pure and sparse forest of Picea glehnii in northern Hokkaido [J] . Akitsu Tomoko K., Nakaji Tatsuro, Yoshida Toshiya, Ecological research . 2020,第5期

机译：北海道纯净，稀疏森林卫星验证和森林结构模型的现场数据
3. Comparing Generalized Linear Models and random forest to model vascular plant species richness using LiDAR data in a natural forest in central Chile [J] . Lopatin J., Dolos K., Hernandez H. J., Remote Sensing of Environment: An Interdisciplinary Journal . 2016,第Null期

机译：智利中部天然林中使用LiDAR数据比较广义线性模型和随机森林以模拟维管植物物种丰富度
4. Morphological random forests for language modeling of inflectional languages [C] . Oparin Ilya, Glembek Ondrej, Burget Lukas, Workshop on Spoken Language Technology . 2008

机译：拐点语言建模的形态随机林
5. Random forests and the data sparseness problem in language modeling. [D] . Xu, Peng. 2005

机译：语言建模中的随机森林和数据稀疏问题。
6. Evaluating parameters for ligand-based modeling with random forest on sparse data sets [O] . Alexander Kensert, Jonathan Alvarsson, Ulf Norinder, 2018

机译：在稀疏数据集上评估基于配体的随机森林建模参数
7. Exploiting random projections and sparsity with random forests and gradient boosting methods - Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity [O] . Joly, Arnaud 2017

机译：利用随机森林和梯度提升方法开发随机投影和稀疏度-应用于多标签和多输出学习，随机森林模型压缩和利用输入稀疏度

Random forests and the data sparseness problem in language modeling

摘要

著录项

相似文献

相关主题

期刊订阅