首页> 外文会议>International conference on recent advances in natural language processing >Translating Dialectal Arabic as Low Resource Language using Word Embedding
【24h】

Translating Dialectal Arabic as Low Resource Language using Word Embedding

机译:使用词嵌入将方言阿拉伯语转换为低资源语言

获取原文

摘要

A number of machine translation methods have been proposed in recent years to deal with the increasingly important problem of automatic translation between texts of different languages or languages and their dialects. These methods have produced promising results when applied to some of the widely studied languages. Existing translation methods are mainly implemented using rule-based and static machine translation approaches. Rule based approaches utilize language translation rules that can either be constructed by an expert, which is quite difficult when dealing with dialects, or rely on rule construction algorithms, which require very large parallel datasets. Statistical approaches also require large parallel datasets to build the translation models. However, large parallel datasets do not exist for languages with low resources, such as the Arabic language and its dialects. In this paper we propose an algorithm that attempts to overcome this limitation, and apply it to translate the Egyptian dialect (EGY) to Modern Standard Arabic (MSA). Monolingual corpus was collected for both MSA and EGY and a relatively small parallel language pair set was built to train the models. The proposed method utilizes Word embeddings as it requires monolingual data rather than parallel corpus. Both Continuous Bag of Words and Skip-gram were used to build word vectors. The proposed method was validated on four different datasets using a four-fold cross validation approach.
机译:近年来,已经提出了许多机器翻译方法,以解决不同语言或不同语言的文本及其方言之间越来越重要的自动翻译问题。当将这些方法应用于一些广泛研究的语言时,已产生了令人鼓舞的结果。现有的翻译方法主要使用基于规则的和静态机器翻译方法来实现。基于规则的方法利用语言翻译规则,该规则可以由专家构建(在处理方言时非常困难),也可以依赖规则构建算法,而规则构建算法需要非常大的并行数据集。统计方法还需要大型并行数据集来构建翻译模型。但是,对于资源较少的语言(例如阿拉伯语言及其方言),不存在大型并行数据集。在本文中,我们提出了一种试图克服此限制的算法,并将其应用于将埃及方言(EGY)转换为现代标准阿拉伯语(MSA)。 MSA和EGY都收集了单语语料,并建立了一个相对较小的并行语言对集来训练模型。所提出的方法利用单词嵌入,因为它需要单语数据而不是并行语料。单词连续袋和跳过语法都用于构建单词向量。使用四重交叉验证方法在四个不同的数据集上验证了所提出的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号