首页> 外文期刊>Language Resources and Evaluation >Exploring the sawa corpus: collection and deployment of a parallel corpus English-Swahili
【24h】

Exploring the sawa corpus: collection and deployment of a parallel corpus English-Swahili

机译:探索锯齿语料库:英语-斯瓦希里语平行语料库的收集和部署

获取原文
获取原文并翻译 | 示例

摘要

Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English-Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English-Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.
机译:机器翻译和语料注释的研究大大受益于单词对齐的并行语料库的日益普及。本文介绍了对200万字的英语-斯瓦希里语平行语料——sawa语料库的开发和应用的持续研究。我们描述了数据收集阶段,从零开始,介绍了为此语言对找到合适且易于访问的数据的困难。在数据注释阶段,语料库是半自动的句子,并且单词对齐,并且词法和句法信息被添加到语料库的英语和斯瓦希里语部分。带注释的并行语料库使我们可以研究两种可能的用法。我们使用并行语料库和现有英语-斯瓦希里语翻译的合并数据库,描述了将英语的词性标注注释从英语投影到斯瓦希里语的实验,以及针对该语言对的基本统计机器翻译系统的开发字典。我们特别关注将英语翻译成斯瓦希里语的形态更为复杂的班图语的困难。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号