首页> 外文会议>International Conference on Advances in Natural Language Processing >Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study
【24h】

Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study

机译:从语法生成数据培训统计语言模型:比较案例研究

获取原文

摘要

Statistical language models (SLMs) for speech recognition have the advantage of robustness, and grammar-based models (GLMs) the advantage that they can be built even when little corpus data is available. A known way to attempt to combine these two methodologies is first to create a GLM, and then use that GLM to generate training data for an SLM. It has however been difficult to evaluate the true utility of the idea, since the corpus data used to create the GLM has not in general been explicitly available. We exploit the Open Source Regulus platform, which supports corpus-based construction of linguistically motivated GLMs, to perform a methodologically sound comparison: the same data is used both to create an SLM directly, and also to create a GLM, which is then used to generate data to train an SLM. An evaluation on a medium-vocabulary task showed that the indirect method of constructing the SLM is in fact only marginally better than the direct one. The method used to create the training data is critical, with PCFG generation heavily outscoring CFG generation.
机译:语音识别的统计语言模型(SLM)具有鲁棒性的优点,以及基于语法的模型(GLMS)即使在可用的小语料库数据时也可以建立的优势。尝试组合这两种方法的已知方式首先是创建GLM,然后使用该GLM为SLM生成训练数据。然而,它一直很难评估这个想法的真正效用,因为用于创建GLM的语料库数据一般没有明确可用。我们利用开源康乐平台,支持基于语料库的语言动态的魅力,以执行方法上的声音比较:使用相同的数据都是直接创建SLM,也可以创建GLM,然后创建一个生成要培训SLM的数据。对中学词汇任务的评估表明,构建SLM的间接方法实际上仅比直接更好地更好地更好。用于创建培训数据的方法是至关重要的,PCFG生成严重超越CFG生成。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号