首页> 外文会议>Advances in Natural Language Processing >Training Statistical Language Models from Grammar-Generated Data: A ComparativeCase-Study
【24h】

Training Statistical Language Models from Grammar-Generated Data: A ComparativeCase-Study

机译:从语法生成的数据中训练统计语言模型:一个比较案例研究

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Statistical language models (SLMs) for speech recognition have the advantage of robustness, and grammar-based models (GLMs) the advantage that they can be built even when little corpus data is available. A known way to attempt to combine these two methodologies is first to create a GLM, and then use that GLM to generate training data for an SLM. It has however been difficult to evaluate the true utility of the idea, since the corpus data used to create the GLM has not in general been explicitly available. We exploit the Open Source Regu-lus platform, which supports corpus-based construction of linguistically motivated GLMs, to perform a methodologically sound comparison: the same data is used both to create an SLM directly, and also to create a GLM, which is then used to generate data to train an SLM. An evaluation on a medium-vocabulary task showed that the indirect method of constructing the SLM is in fact only marginally better than the direct one. The method used to create the training data is critical, with PCFG generation heavily outscoring CFG generation.
机译:用于语音识别的统计语言模型(SLM)具有鲁棒性的优势,而基于语法的模型(GLM)的优势在于即使没有足够的语料库数据也可以构建它们。尝试将这两种方法结合起来的已知方法是,首先创建一个GLM,然后使用该GLM生成SLM的训练数据。但是,由于通常无法明确获得用于创建GLM的语料数据,因此很难评估该想法的真正用途。我们利用开放源代码Regu-lus平台(该平台支持基于语料库的语言驱动的GLM构建)进行方法上合理的比较:相同的数据既用于直接创建SLM,又用于创建GLM,然后用于生成训练SLM的数据。对中等词汇量任务的评估表明,构造SLM的间接方法实际上仅比直接方法好一点。创建训练数据的方法至关重要,因为PCFG的生成远远超过CFG的生成。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号