首页> 外文会议>International conference on image analysis and processing >Dealing with Bilingualism in Automatic Transcription of Historical Archive of Czech Radio
【24h】

Dealing with Bilingualism in Automatic Transcription of Historical Archive of Czech Radio

机译:处理捷克广播电台历史档案自动抄录中的双语现象

获取原文

摘要

One of the biggest challenges in the automatic transcription of the historical audio archive of Czech and Czechoslovak radio is bilingualism. Two closely related languages, Czech and Slovak, are mixed in many archive documents. Both were the official languages in former Czechoslovakia (1918-1992) and both were used in media. The two languages are considered similar, although they differ in more than 75 % of their lexical inventories, which complicates automatic speech-to-text conversion. In this paper, we present and objectively measure the difference between the two languages. After that we propose a method suitable for automatic identification of two acoustically and lexically similar languages. It is based on employing 2 size-optimized parallel lexicons and language models. On large test data, we show that the 2 languages can be distinguished with almost 99 % accuracy. Moreover, the language identification module can be easily incorporated into a 2-pass decoding scheme with almost negligible additional computation costs. The proposed method has been employed in the project aimed at the disclosure of Czech and Czechoslovak oral cultural heritage.
机译:自动复制捷克和捷克斯洛伐克电台的历史音频档案的最大挑战之一是双语。许多档案文件中混合使用两种紧密相关的语言,捷克语和斯洛伐克语。两种语言都是前捷克斯洛伐克(1918-1992)的官方语言,并且都在媒体中使用。两种语言被认为是相似的,尽管它们的词汇表相差超过75%,这使自动语音到文本转换变得复杂。在本文中,我们提出并客观地衡量了两种语言之间的差异。之后,我们提出了一种适用于自动识别两种听觉和词汇上相似的语言的方法。它基于使用2个大小优化的并行词典和语言模型。在大量的测试数据上,我们表明可以区分两种语言,几乎达到了99%的准确度。而且,语言识别模块可以很容易地并入到两遍解码方案中,而附加的计算成本几乎可以忽略不计。该提议的方法已用于旨在公开捷克和捷克斯洛伐克口头文化遗产的项目中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号