首页> 外文OA文献 >Performance of Czech Speech Recognition with Language Models Created from Public Resources
【2h】

Performance of Czech Speech Recognition with Language Models Created from Public Resources

机译:利用公共资源创建的语言模型实现捷克语音识别的性能

摘要

In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates) and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.
机译:在本文中,我们调查了可公开使用的n-gram语料库在创建适用于捷克语音识别系统的语言模型(LM)方面的可用性。具有两个参数和设置的N-gram LM是从两个公开可用的集合中创建的:Google提供的Czech Web 1T 5克语料库和从捷克国家语料库研究所获得的5语料库。为了进行比较,我们还测试了由捷克媒体采矿公司收集的大量私人报纸和广播文本资源制成的LM。从统计角度(主要是通过它们的困惑率)和在大型词汇连续语音识别系统中使用时的性能角度分析和比较了LM。我们的研究表明,即使经过大量的清理和规范化程序,基于Web1T的LM也无法与由较小但更一致的语料库组成的LM竞争。对大量测试数据进行的实验还说明了捷克语作为高屈折性语言对困惑度,OOV和识别准确率的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号