首页> 外文会议>9th International conference on language resources and evaluation >Automatic Long Audio Alignment and Confidence Scoring for Conversational Arabic Speech
【24h】

Automatic Long Audio Alignment and Confidence Scoring for Conversational Arabic Speech

机译:会话阿拉伯语音的自动长音频对齐和置信度评分

获取原文

摘要

In this paper, a framework for long audio alignment for conversational Arabic speech is proposed. Accurate alignments help in many speech processing tasks such as audio indexing, speech recognizer acoustic model (AM) training, audio summarizing and retrieving, etc. We have collected more than 1,400 hours of conversational Arabic besides the corresponding human generated non-aligned transcriptions. Automatic audio segmentation is performed using a split and merge approach. A biased language model (LM) is trained using the corresponding text after a pre-processing stage. Because of the dominance of non-standard Arabic in conversational speech, a graphemic pronunciation model (PM) is utilized. The proposed alignment approach is performed in two passes. Firstly, a generic standard Arabic AM is used along with the biased LM and the graphemic PM in a fast speech recognition pass. In a second pass, a more restricted LM is generated for each audio segment, and unsupervised acoustic model adaptation is applied. The recognizer output is aligned with the processed transcriptions using Levenshtein algorithm. The proposed approach resulted in an initial alignment accuracy of 97.8-99.0% depending on the amount of disfluencies. A confidence scoring metric is proposed to accept/reject aligner output. Using confidence scores, it was possible to reject the majority of mis-aligned segments resulting in alignment accuracy of 99.0-99.8% depending on the speech domain and the amount of disfluencies.
机译:在本文中,提出了一种用于阿拉伯语会话语音的长音频对齐的框架。准确的对齐方式可以帮助完成许多语音处理任务,例如音频索引,语音识别器声学模型(AM)训练,音频摘要和检索等。我们已经收集了1400多个会话阿拉伯语,除了相应的人类生成的非对齐转录外。使用拆分和合并方法执行自动音频分段。在预处理阶段之后,使用相应的文本来训练有偏语言模型(LM)。由于非标准阿拉伯语在会话语音中占主导地位,因此使用了音素发音模型(PM)。提议的对齐方法分两步执行。首先,在快速语音识别过程中,将通用标准阿拉伯语AM与偏向LM和字素PM一起使用。在第二遍中,为每个音频片段生成一个更严格的LM,并应用无监督的声学模型自适应。使用Levenshtein算法将识别器的输出与已处理的转录对齐。所提出的方法根据不同的废液量,其初始对准精度为97.8-99.0%。建议采用置信度评分标准来接受/拒绝对齐器输出。使用置信度评分,可以拒绝大多数未对齐的片段,从而根据语音域和疏散程度,导致对齐精度为99.0-99.8%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号