Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study

机译：从语法生成数据培训统计语言模型：比较案例研究

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Statistical language models (SLMs) for speech recognition have the advantage of robustness, and grammar-based models (GLMs) the advantage that they can be built even when little corpus data is available. A known way to attempt to combine these two methodologies is first to create a GLM, and then use that GLM to generate training data for an SLM. It has however been difficult to evaluate the true utility of the idea, since the corpus data used to create the GLM has not in general been explicitly available. We exploit the Open Source Regulus platform, which supports corpus-based construction of linguistically motivated GLMs, to perform a methodologically sound comparison: the same data is used both to create an SLM directly, and also to create a GLM, which is then used to generate data to train an SLM. An evaluation on a medium-vocabulary task showed that the indirect method of constructing the SLM is in fact only marginally better than the direct one. The method used to create the training data is critical, with PCFG generation heavily outscoring CFG generation.

机译：语音识别的统计语言模型（SLM）具有鲁棒性的优点，以及基于语法的模型（GLMS）即使在可用的小语料库数据时也可以建立的优势。尝试组合这两种方法的已知方式首先是创建GLM，然后使用该GLM为SLM生成训练数据。然而，它一直很难评估这个想法的真正效用，因为用于创建GLM的语料库数据一般没有明确可用。我们利用开源康乐平台，支持基于语料库的语言动态的魅力，以执行方法上的声音比较：使用相同的数据都是直接创建SLM，也可以创建GLM，然后创建一个生成要培训SLM的数据。对中学词汇任务的评估表明，构建SLM的间接方法实际上仅比直接更好地更好地更好。用于创建培训数据的方法是至关重要的，PCFG生成严重超越CFG生成。

著录项

来源
《International Conference on Advances in Natural Language Processing》|2008年||共12页
会议地点
作者
Beth Ann Hockey; Manny Rayner; Gwen Christian;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词

相似文献

外文文献
中文文献
专利

1. Statistical Parametric Speech Synthesis of Malay Language using Found Training Data [J] . Lau Chee Yong, Tan Tian Swee Research journal of applied science, engineering and technology . 2014,第24期

机译：使用找到的训练数据进行马来语统计参数语音合成
2. Statistical Parametric Speech Synthesis of Malay Language using Found Training Data [J] . Lau Chee Yong, Tan Tian Swee Research journal of applied science, engineering and technology . 2014,第24期

机译：使用发现培训数据的马来语语言统计参数致辞
3. Collecting SMT Language Model Training Data for Low Source Language [J] . Mamtily Nighmat, Izumi Yamamoto International journal of computer science and network security . 2017,第11期

机译：收集低源语言的SMT语言模型训练数据
4. Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study [C] . Beth Ann Hockey, Manny Rayner, Gwen Christian International Conference on Advances in Natural Language Processing . 2008

机译：从语法生成数据培训统计语言模型：比较案例研究
5. A Comparative Study of Vector Space Language Models for Sentiment Analysis Using Reddit Data [D] . ?Liu, Yang 2020

机译：用红线数据的传染媒介空间模型对比较研究
6. Learning statistical models of phenotypes using noisy labeled training data [O] . Vibhu Agarwal, Tanya Podchiyska, Juan M Banda, 2016

机译：使用带噪声的训练数据学习表型的统计模型
7. Training Statistical Language Models from Grammar-Generated Data: a Comparative Case-Study [O] . Hockey, Beth Ann, Rayner, Emmanuel, Christian, Gwen 2008

机译：从语法生成的数据中训练统计语言模型：一个比较案例研究

Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study

摘要

著录项

相似文献

相关主题

期刊订阅