Training Statistical Language Models from Grammar-Generated Data: A ComparativeCase-Study

机译：从语法生成的数据中训练统计语言模型：一个比较案例研究

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Statistical language models (SLMs) for speech recognition have the advantage of robustness, and grammar-based models (GLMs) the advantage that they can be built even when little corpus data is available. A known way to attempt to combine these two methodologies is first to create a GLM, and then use that GLM to generate training data for an SLM. It has however been difficult to evaluate the true utility of the idea, since the corpus data used to create the GLM has not in general been explicitly available. We exploit the Open Source Regu-lus platform, which supports corpus-based construction of linguistically motivated GLMs, to perform a methodologically sound comparison: the same data is used both to create an SLM directly, and also to create a GLM, which is then used to generate data to train an SLM. An evaluation on a medium-vocabulary task showed that the indirect method of constructing the SLM is in fact only marginally better than the direct one. The method used to create the training data is critical, with PCFG generation heavily outscoring CFG generation.

机译：用于语音识别的统计语言模型（SLM）具有鲁棒性的优势，而基于语法的模型（GLM）的优势在于即使没有足够的语料库数据也可以构建它们。尝试将这两种方法结合起来的已知方法是，首先创建一个GLM，然后使用该GLM生成SLM的训练数据。但是，由于通常无法明确获得用于创建GLM的语料数据，因此很难评估该想法的真正用途。我们利用开放源代码Regu-lus平台（该平台支持基于语料库的语言驱动的GLM构建）进行方法上合理的比较：相同的数据既用于直接创建SLM，又用于创建GLM，然后用于生成训练SLM的数据。对中等词汇量任务的评估表明，构造SLM的间接方法实际上仅比直接方法好一点。创建训练数据的方法至关重要，因为PCFG的生成远远超过CFG的生成。

著录项

来源
《Advances in Natural Language Processing》|2008年|P.193-204|共12页
会议地点 Gothenburg(SE);Gothenburg(SE)
作者
Beth Ann Hockey; Manny Rayner; Gwen Christian;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类程序设计、软件工程;
关键词

相似文献

外文文献
中文文献
专利

1. Statistical Parametric Speech Synthesis of Malay Language using Found Training Data [J] . Lau Chee Yong, Tan Tian Swee Research journal of applied science, engineering and technology . 2014,第24期

机译：使用找到的训练数据进行马来语统计参数语音合成
2. Statistical Parametric Speech Synthesis of Malay Language using Found Training Data [J] . Lau Chee Yong, Tan Tian Swee Research journal of applied science, engineering and technology . 2014,第24期

机译：使用发现培训数据的马来语语言统计参数致辞
3. Collecting SMT Language Model Training Data for Low Source Language [J] . Mamtily Nighmat, Izumi Yamamoto International journal of computer science and network security . 2017,第11期

机译：收集低源语言的SMT语言模型训练数据
4. Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study [C] . Beth Ann Hockey, Manny Rayner, Gwen Christian International Conference on Advances in Natural Language Processing . 2008

机译：从语法生成数据培训统计语言模型：比较案例研究
5. Evaluation of Synthetic Training Data and Training-Data-Augmentation Techniques for Object Detection in Ground-Penetrating Radar Data using Deep-Learning Models [D] . Ruggiero, Jean. 2021

机译：使用深度学习模型评估用于地面穿透雷达数据的对象检测的综合训练数据和训练数据增强技术
6. Learning statistical models of phenotypes using noisy labeled training data [O] . Vibhu Agarwal, Tanya Podchiyska, Juan M Banda, 2016

机译：使用带噪声的训练数据学习表型的统计模型
7. Training Statistical Language Models from Grammar-Generated Data: a Comparative Case-Study [O] . Hockey, Beth Ann, Rayner, Emmanuel, Christian, Gwen 2008

机译：从语法生成的数据中训练统计语言模型：一个比较案例研究

Training Statistical Language Models from Grammar-Generated Data: A ComparativeCase-Study

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅