首页> 外文会议>International conference on text, speech and dialogue >Downdating Lexicon and Language Model for Automatic Transcription of Czech Historical Spoken Documents
【24h】

Downdating Lexicon and Language Model for Automatic Transcription of Czech Historical Spoken Documents

机译:自动更新捷克历史口述文件的词汇和语言模型更新

获取原文

摘要

This paper deals with the task of adaptation of an existing Czech large-vocabulary speech recognition (LVCSR) system to the language used in previous historical epochs (before 1990). The goal is to fit its lexicon and language model (LM) so that the system could be employed for the automatic transcription of old spoken documents in the Czech Radio archive. The main problem is the lack of texts (in electronic form) from the 1945-1990 period. The only available and large enough source is digitized copies of Rude Pravo, the newspaper of the former Communist party of Czechoslovakia, the actual ruling body in the state. The newspaper has been scanned and converted into text via an OCR software. However, the amount of OCR errors is very high and so we have to apply several text pre-processing techniques to get a corpus suitable for the lexicon and language model 'downdating' (i.e. adaptation to the past). The proposed techniques helped us a) to reduce the number of out-of-vocabulary strings from 8.5 to 6.4 millions, b) to identify 6.7 thousand history-conditioned word candidates to be added to the lexicon and c) to build a more appropriate LM. The adapted LVCSR system was evaluated on broadcast news from 1969-1989 where its word-error-rate decreased from 17.05 to 14.33%.
机译:本文涉及将现有捷克大型词汇语音识别(LVCSR)系统改编为以前历史时期(1990年前)使用的语言的任务。目标是调整其词典和语言模型(LM),以便该系统可用于自动复制捷克广播电台档案中的旧口头文件。主要问题是从1945年至1990年这段时期缺少文本(电子形式)。唯一可用且足够大的资源是Rude Pravo的数字化副本,Rude Pravo是捷克斯洛伐克前共产党的报纸,该州的实际统治机构。该报纸已被扫描,并通过OCR软件转换为文本。但是,OCR错误的数量非常多,因此我们必须应用几种文本预处理技术才能获得适用于词典和语言模型``降级''(即适应过去)的语料库。所提出的技术帮助我们a)将语音字符串的数量从8.5减少到640万,b)识别出将要添加到词典中的6.7万个有历史条件的单词候选者,以及c)建立更合适的LM 。改编后的LVCSR系统在1969-1989年的广播新闻中得到了评估,其字错误率从17.05%降低到14.33%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号