首页> 外文期刊>Information >The Usefulness of Imperfect Speech Data for ASR Development in Low-Resource Languages
【24h】

The Usefulness of Imperfect Speech Data for ASR Development in Low-Resource Languages

机译:不完善的语音数据对低资源语言ASR开发的有用性

获取原文
获取外文期刊封面目录资料

摘要

When the National Centre for Human Language Technology (NCHLT) Speech corpus was released, it created various opportunities for speech technology development in the 11 official, but critically under-resourced, languages of South Africa. Since then, the substantial improvements in acoustic modeling that deep architectures achieved for well-resourced languages ushered in a new data requirement: their development requires hundreds of hours of speech. A suitable strategy for the enlargement of speech resources for the South African languages is therefore required. The first possibility was to look for data that has already been collected but has not been included in an existing corpus. Additional data was collected during the NCHLT project that was not included in the official corpus: it only contains a curated, but limited subset of the data. In this paper, we first analyze the additional resources that could be harvested from the auxiliary NCHLT data. We also measure the effect of this data on acoustic modeling. The analysis incorporates recent factorized time-delay neural networks (TDNN-F). These models significantly reduce phone error rates for all languages. In addition, data augmentation and cross-corpus validation experiments for a number of the datasets illustrate the utility of the auxiliary NCHLT data.
机译:国家人类语言技术中心(NCHLT)语音语料库发布时,它为11种官方语言(但资源严重不足)的南非创造了各种语音技术开发机会。从那时起,针对资源丰富的语言的深层架构在声学建模方面的实质性改进提出了新的数据要求:它们的开发需要数百小时的语音。因此,需要一种适当的策略来增加南非语言的语音资源。第一种可能性是查找已收集但尚未包含在现有语料库中的数据。在NCHLT项目期间收集了其他数据,这些数据未包含在官方语料库中:它仅包含精选的数据集,但是数量有限。在本文中,我们首先分析可以从辅助NCHLT数据中获取的其他资源。我们还测量了这些数据对声学建模的影响。该分析结合了最近的因式分解时延神经网络(TDNN-F)。这些模型大大降低了所有语言的电话错误率。此外,许多数据集的数据扩充和跨主体验证实验说明了辅助NCHLT数据的实用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号