首页> 外文OA文献 >Performance of Czech Speech Recognition with Language Models Created from Public Resources

【2h】

Performance of Czech Speech Recognition with Language Models Created from Public Resources

机译：利用公共资源创建的语言模型实现捷克语音识别的性能

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates) and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.

机译：在本文中，我们调查了可公开使用的n-gram语料库在创建适用于捷克语音识别系统的语言模型（LM）方面的可用性。具有两个参数和设置的N-gram LM是从两个公开可用的集合中创建的：Google提供的Czech Web 1T 5克语料库和从捷克国家语料库研究所获得的5语料库。为了进行比较，我们还测试了由捷克媒体采矿公司收集的大量私人报纸和广播文本资源制成的LM。从统计角度（主要是通过它们的困惑率）和在大型词汇连续语音识别系统中使用时的性能角度分析和比较了LM。我们的研究表明，即使经过大量的清理和规范化程序，基于Web1T的LM也无法与由较小但更一致的语料库组成的LM竞争。对大量测试数据进行的实验还说明了捷克语作为高屈折性语言对困惑度，OOV和识别准确率的影响。

著录项

作者
Prochazka V.; Pollak P.; Zdansky J.; Nouza J.;
展开▼
作者单位

展开▼
年度 2011
总页数
原文格式 PDF
正文语种 en
中图分类

相似文献

外文文献
中文文献
专利

1. Performance of Czech Speech Recognition with Language Models Created from Public Resources [J] . V. Prochazka, P. Pollak, J. Zdansky, Radioengineering . 2011,第4期

机译：利用公共资源创建的语言模型实现捷克语音识别的性能
2. Comparison of Performance of Enhanced Morpheme-based Language Model with Different Word-based Language Models for Improving the Performance of Tamil Speech Recognition System [J] . S. SARASWATHI, T.V. GEETHA ACM transactions on Asian language information processing . 2007,第3期

机译：增强的基于词素的语言模型与不同的基于单词的语言模型的性能比较，以提高泰米尔语语音识别系统的性能
3. Modeling under-resourced languages for speech recognition [J] . Kurimo Mikko, Enarvi Seppo, Tilk Ottokar, Language Resources and Evaluation . 2017,第4期

机译：为语音识别建模资源不足的语言
4. LANGUAGE MODEL SUPPORT FOR CONTINUOUS SPEECH RECOGNITION IN CZECH LANGUAGE [C] . Dana Nejedlova, Jan Nouza IASTED (the International Association of Science and Technology for Development) International Conference on Signal Processing, Pattern Recognition, and Application, Jun 25-28, 2002, Crete, Greece . 2002

机译：捷克语中连续语音识别的语言模型支持
5. Speech-Language Services in Public Schools: How Ambiguity in IDEA Eligibility Criteria Impacts Speech-Language Pathologists in a Litigious and Resource Constrained Environment. [D] . Sylvan, Lesley. 2013

机译：公立学校的语言服务：IDEA资格标准中的歧义性如何在诉讼和资源受限的环境中影响语言病理学家。
6. Retrospective Analysis of Clinical Performance of an Estonian Speech Recognition System for Radiology: Effects of Different Acoustic and Language Models [O] . A. Paats, T. Alumäe, E. Meister, 2018

机译：一项爱沙尼亚放射线语音识别系统临床表现的回顾性分析：不同声学和语言模型的影响
7. Modeling under-resourced languages for speech recognition [O] . Kurimo, Mikko, Enarvi, Seppo, Tilk, Ottokar, 2016

机译：为语音识别建模资源不足的语言

Performance of Czech Speech Recognition with Language Models Created from Public Resources

摘要

著录项

相似文献

相关主题

期刊订阅