【24h】

On Development of Consistently Punctuated Speech Corpora

机译:一致标点语料库的发展

获取原文

摘要

Punctuation of automatically recognized speech is important to enhance readability of transcripts and to aid downstream NLP processing. This paper is concerned with issues involved in developing training and test corpora for automatic punctuation systems. Punctuation annotation in speech transcripts is difficult since there are numerous cases for which no standard punctuation rules exist. Special punctuation annotation guidelines tailored to spoken language were developed. Using these guidelines, almost 100 hours of broadcast news and conversation data in English and French have been punctuated by trained annota-tors. Measures of inter-annotator agreement are provided for both languages and differences between languages and genre are analyzed and discussed, along with some of the most frequent disagreements between annotators. Overall, using the guidelines, the annotation consistency has been significantly improved.
机译:自动识别语音的标点对于提高笔录的可读性和帮助下游NLP处理非常重要。本文关注与开发自动标点系统的培训和测试语料库有关的问题。语音笔录中的标点符号注释非常困难,因为在许多情况下,不存在标准的标点符号规则。制定了专门针对口语的特殊标点符号注释准则。根据这些指南,训练有素的注释员已将英语和法语的新闻和会话数据广播时间缩短了近100个小时。提供了针对两种语言的注释者之间协议的度量,并且分析和讨论了语言和体裁之间的差异,以及注释者之间最常见的一些分歧。总体而言,使用该准则,注释的一致性得到了显着改善。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号