首页> 外文期刊>Audio, Speech, and Language Processing, IEEE Transactions on >Bilingual Experiments on Automatic Recovery of Capitalization and Punctuation of Automatic Speech Transcripts
【24h】

Bilingual Experiments on Automatic Recovery of Capitalization and Punctuation of Automatic Speech Transcripts

机译:自动恢复大写和标点自动语音的双语实验

获取原文
获取原文并翻译 | 示例

摘要

This paper focuses on the tasks of recovering capitalization and punctuation marks from texts without that information, such as spoken transcripts, produced by automatic speech recognition systems. These two practical rich transcription tasks were performed using the same discriminative approach, based on maximum entropy, suitable for on-the-fly usage. Reported experiments were conducted both over Portuguese and English broadcast news data. Both force aligned and automatic transcripts were used, allowing to measure the impact of the speech recognition errors. Capitalized words and named entities are intrinsically related, and are influenced by time variation effects. For that reason, the so-called language dynamics have been addressed for the capitalization task. Language adaptation results indicate, for both languages, that the capitalization performance is affected by the temporal distance between the training and testing data. In what regards the punctuation task, this paper covers the three most frequent punctuation marks: full stop, comma, and question marks. Different methods were explored for improving the baseline results for full stop and comma. The first uses punctuation information extracted from large written corpora. The second applies different levels of linguistic structure, including lexical, prosodic, and speaker related features. The comma detection improved significantly in the first method, thus indicating that it depends more on lexical features. The second method provided even better results, for both languages and both punctuation marks, best results being achieved mainly for full stop. As for question marks, there is a small gain, but differences are not very significant, due to the relatively small number of question marks in the corpora.
机译:本文的重点是从没有自动语音识别系统产生的信息(例如口语成绩单)的文本中恢复大写和标点符号的任务。这两个实用的丰富转录任务是基于最大熵使用相同的判别方法执行的,适用于即时使用。在葡萄牙语和英语广播新闻数据上都进行了报道的实验。强制对齐和自动抄本都可以使用,从而可以测量语音识别错误的影响。大写单词和命名实体具有内在联系,并受时间变化效应的影响。因此,大写任务已解决了所谓的语言动态问题。语言适应性结果表明,对于两种语言,大写性能都受训练数据和测试数据之间的时间距离的影响。在标点符号任务方面,本文介绍了三种最常见的标点符号:句号,逗号和问号。探索了各种方法来改善句号和逗号的基线结果。第一种使用从大型书面语料库中提取的标点符号信息。第二种应用了不同层次的语言结构,包括词汇,韵律和说话者相关特征。在第一种方法中,逗号检测得到了显着改善,因此表明它更多地依赖于词汇特征。第二种方法为两种语言和两种标点符号都提供了更好的结果,主要针对句号获得了最佳结果。至于问号,收益很小,但是由于语料库中问号的数量相对较少,因此差异不是很大。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号