首页> 外文期刊>Computer speech and language >Synthesised bigrams using word embeddings for code-switched ASR of four South African language pairs
【24h】

Synthesised bigrams using word embeddings for code-switched ASR of four South African language pairs

机译:使用词嵌入合成的二元语言对四种南非语言对进行代码转换的ASR

获取原文
获取原文并翻译 | 示例

摘要

Code-switching is the phenomenon whereby multilingual speakers spontaneously alternate between more than one language during discourse and is widespread in multilingual societies. Current state-of-the-art automatic speech recognition (ASR) systems are optimised for monolingual speech, but performance degrades severely when presented with multiple languages. We address ASR of speech containing switches between English and four South African Bantu languages. No comparable study on code-switched speech for these languages has been conducted before and consequently no directly applicable benchmarks exist. A new and unique corpus containing 14.3 hours of spontaneous speech extracted from South African soap operas was used to perform our study. The varied nature of the code-switching in this data presents many challenges to ASR. We focus specifically on how the language model can be improved to better model bilingual language switches for English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. Code-switching examples in the corpus transcriptions were extremely sparse, with the majority of code-switched bigrams occurring only once. Furthermore, differences in language typology between English and the Bantu languages and among the Bantu languages themselves contribute further challenges. We propose a new method using word embeddings trained on text data that is both out-of-domain and monolingual for the synthesis of artificial bilingual code-switched bigrams to augment the sparse language modelling training data. This technique has the particular advantage of not requiring any additional training data that includes code-switching. We show that the proposed approach is able to synthesise valid codeswitched bigrams not seen in the training set. We also show that, by augmenting the training set with these bigrams, we are able to achieve notable reductions for all language pairs in the overall perplexity and particularly substantial reductions in the perplexity calculated across a language switch boundary (between 5 and 31%). We demonstrate that the proposed approach is able to reduce the unseen code-switched bigram types in the test sets by up to 20.5%. Finally, we show that the augmented language models achieve reductions in the word error rate for three of the four language pairs considered. The gains were larger for language pairs with disjunctive orthography than for those with conjunctive orthography. We conclude that the augmentation of language model training data with code-switched bigrams synthesised using word embeddings trained on out-of-domain monolingual text is a viable means of improving the performance of ASR for code-switched speech. (C) 2018 Elsevier Ltd. All rights reserved.
机译:代码转换是一种现象,在对话过程中,使用多种语言的人会自发地在一种以上的语言之间交替,并在多种语言的社会中广泛存在。当前最先进的自动语音识别(ASR)系统已针对单语种语音进行了优化,但是当使用多种语言呈现时,性能会严重下降。我们解决了语音ASR,其中包含英语和四种南非班图语之间的切换。之前尚未针对这些语言的代码转换语音进行过可比的研究,因此没有直接适用的基准。我们使用了一个新的独特语料库,该语料库包含从南非肥皂剧中提取的14.3小时的自发语音。此数据中代码转换的不同性质给ASR带来了许多挑战。我们特别关注如何改进语言模型以更好地为英语-isiZulu,英语-isiXhosa,英语-Setswana和英语-塞索托语的双语语言转换建模。语料库转录中的代码转换示例极为稀疏,大多数代码转换的二元组仅发生一次。此外,英语和班图语之间以及班图语本身之间的语言类型差异也带来了进一步的挑战。我们提出了一种使用在文本数据上训练的单词嵌入的新方法,该方法在域外和单语言两种情况下用于合成人工双语代码转换的双字母组,以增强稀疏语言建模训练数据。该技术的特殊优势是不需要任何其他包含代码切换的训练数据。我们表明,所提出的方法能够综合训练集中未发现的有效代码转换二元模型。我们还表明,通过使用这些二元组增加训练集,我们能够显着降低所有语言对的总体困惑度,尤其是在跨语言切换边界的情况下计算得出的困惑度显着降低(5%至31%)。我们证明了所提出的方法能够将测试集中看不见的代码转换双字组类型减少多达20.5%。最后,我们表明,增强语言模型可降低所考虑的四种语言对中三对的单词错误率。具有正交拼写法的语言对的收益要大于具有正交拼写法的语言对的收益。我们得出的结论是,使用在域外单语文本上训练的词嵌入合成的代码转换二元语言对语言模型训练数据进行增强,是提高代码转换语音ASR性能的可行方法。 (C)2018 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号