...
首页> 外文期刊>Procedia Computer Science >Building a First Language Model for Code-switch Arabic-English
【24h】

Building a First Language Model for Code-switch Arabic-English

机译:建立用于代码转换的阿拉伯语-英语的第一语言模型

获取原文
           

摘要

The use of mixed languages in daily conversations, referred to as “code-switching”, has become a common linguistic phenomenon among bilingual/multilingual communities. Code-switching involves the alternating use of distinct languages or “codes” at sentence boundaries or within the same sentence. With the rise of globalization, code-switching has become prevalent in daily conversations, especially among urban youth. This lead to an increasing demand on automatic speech recognition systems to be able to handle such mixed speech. In this paper, we present the first steps towards building a multilingual language model (LM) for code-switched Arabic-English. One of the main challenges faced when building a multilingual LM is the need of explicit mixed text corpus. Since code-switching is a behaviour used more commonly in spoken than written form, text corpora with code-switching are usually scarce. Therefore, the first aim of this paper is to introduce a code-switch Arabic-English text corpus that is collected by automatically downloading relevant documents from the web. The text is then extracted from the documents and processed to be useable by NLP tasks. For language modeling, a baseline LM was built from existing monolingual corpora. The baseline LM gave a perplexity of 11841.9 and Out-of-Vocabulary (OOV) rate of 4.07%. The gathered code-switch Arabic-English corpus, along with the existing monolingual corpora were then used to construct several LMs. The best LM achieved a great improvement over the baseline LM, with a perplexity of 275.41 and an OOV rate of 0.71%.
机译:在日常对话中使用混合语言(称为“代码转换”)已成为双语/多语言社区中的常见语言现象。代码转换涉及在句子边界或同一句子内交替使用不同的语言或“代码”。随着全球化的兴起,在日常对话中,尤其是在城市青年中,代码转换已变得十分普遍。这导致对能够处理这种混合语音的自动语音识别系统的需求增加。在本文中,我们介绍了为代码转换的阿拉伯语-英语建立多语言语言模型(LM)的第一步。建立多语言LM时面临的主要挑战之一是需要明确的混合文本语料库。由于代码转换是一种比书面形式更常用于口头表达的行为,因此带有代码转换的文本语料库通常很少。因此,本文的首要目的是介绍一种代码转换的阿拉伯语-英语文本语料库,该语料库是通过从网络上自动下载相关文档来收集的。然后从文档中提取文本并进行处理以供NLP任务使用。对于语言建模,从现有的单语语料库构建了基线LM。基线LM的困惑度为11841.9,词汇外(OOV)率为4.07%。然后,使用收集的代码转换阿拉伯语-英语语料库以及现有的单语语料库来构建多个LM。最佳LM与基线LM相比取得了很大的改进,其困惑度为275.41,OOV率为0.71%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号