首页> 外文会议>Speech Technology and Human-Computer Dialogue, 2009. SpeD '09 >Text conditioning and statistical language modeling for Romanian language
【24h】

Text conditioning and statistical language modeling for Romanian language

机译:罗马尼亚语的文本条件和统计语言建模

获取原文

摘要

In this paper we present a synthesis of the theoretical fundamentals and some practical aspects of statistical (n-gram) language modeling which is a main part of a large vocabulary statistical speech recognition system. There are presented the unigram, bigram and trigram language models as well as the Good-turing estimator based Katz back-off smoothing algorithm. There is also described the perplexity measure of a language model used for evaluation. The practical experiments were made on Romanian constitution corpus. There are also presented the text normalization steps before the language model generation. The results are ARPA-MIT format language models for Romanian language. The models were tested and compared using perplexity measure. Finally some comparisons were made between Romanian and English language modeling and conclusions are drawn.
机译:在本文中,我们提出了统计(n-gram)语言建模的理论基础和一些实际方面的综合信息,这是大型词汇统计语音识别系统的主要部分。介绍了unigram,bigram和trigram语言模型以及基于Good-turing估计器的Katz补偿平滑算法。还描述了用于评估的语言模型的困惑度度量。在罗马尼亚宪法语料库上进行了实际实验。还介绍了语言模型生成之前的文本规范化步骤。结果是针对罗马尼亚语言的ARPA-MIT格式语言模型。测试了模型并使用困惑度度量进行了比较。最后对罗马尼亚和英语语言模型进行了一些比较,并得出了结论。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号