首页> 外文会议>International Conference on Language Resources and Evaluation >Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
【24h】

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

机译:处理用拉丁语脚本编写的南亚语言:Dakshina DataSet

获取原文

摘要

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text, 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons, and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text.
机译:本文介绍了Dakshina DataSet,这是一个由拉丁语和本机脚本中的文本组成的新资源,用于12个南亚语言。数据集包括,对于每种语言:1)本机脚本维基百科文本,2)罗马化词典; 3)语言的本机脚本和基本拉丁字母的本机脚本中的完整句子并行数据。我们记录用于编制和选择每种语言的维基百科文本的方法;针对采样的词典的证明罗马化的集合,以及来自本机脚本集合的手工罗马化。我们另外提供基线导致数据集可能实现的多个任务,包括本机脚本和罗马化文本的单词音译,完整句子音译和语言建模。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号