首页> 外文会议>International Conference on Language Resources and Evaluation >Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

【24h】

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

机译：处理用拉丁语脚本编写的南亚语言：Dakshina DataSet

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text, 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons, and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text.

机译：本文介绍了Dakshina DataSet，这是一个由拉丁语和本机脚本中的文本组成的新资源，用于12个南亚语言。数据集包括，对于每种语言：1）本机脚本维基百科文本，2）罗马化词典; 3）语言的本机脚本和基本拉丁字母的本机脚本中的完整句子并行数据。我们记录用于编制和选择每种语言的维基百科文本的方法;针对采样的词典的证明罗马化的集合，以及来自本机脚本集合的手工罗马化。我们另外提供基线导致数据集可能实现的多个任务，包括本机脚本和罗马化文本的单词音译，完整句子音译和语言建模。

著录项

来源
《International Conference on Language Resources and Evaluation 》|2020年|2413-2423|共11页
会议地点
作者
Brian Roark; Lawrence Wolf-Sonkin; Christo Kirov; Sabrina J. Mielke; Cibu Johny; Isin Demirsahin; Keith Hall;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
romanization; transliteration; South Asian languages;

机译：罗马化;音译;南亚语言;

相似文献

外文文献
中文文献
专利

1. Tone marks as vowel diacritics in two scripts: repurposing tone marks for non-tonal phenomena in Cado and other Southeast Asian languages [J] . Seth Vitrano-Wilson, Ryan Gehrmann, Carolyn Miller, Writing Systems Research . 2018 ,第1a2期

机译：调标作为两个脚本中的元音标记：在CADO和其他东南亚语言中重新调整非音调现象的色调标记
2. Cultural and social processes of language brokering among Arab, Asian, and Latin immigrants [J] . Guan Shu-Sha Angie, Nash Afaf, Orellana Marjorie Faulstich Journal of Multilingual & Multicultural Development . 2016 ,第1a2期

机译：阿拉伯，亚洲和拉丁移民之间的语言中介的文化和社会过程
3. Written Language Impairments in Primary Progressive Aphasia: A Reflection of Damage to Central Semantic and Phonological Processes [J] . Maya L. Henry Pélagie M. Beeson Gene E. Alexander Steven Z. Rapcsak Journal of Cognitive Neuroscience . 2012 ,第2期

机译：原发性进行性失语症的书面语言障碍：对中心语义和语音过程的损害的反映。
4. Latin script keyboards for South Asian languages with finite-state normalization [C] . Lawrence Wolf-Sonkin, Vlad Schogol, Brian Roark, International workshop on finite state methods and natural language processing . 2019

机译：带有有限状态归一化功能的南亚语言的拉丁文字键盘
5. Comparative Literature in the Spirit of Bandung: Script Change, Language Choice, and Ideology in African and Asian Literatures (Senegal & Indonesia) [D] . Lienau, Annette Damayanti 2011

机译：万隆精神中的比较文学：非洲和亚洲文学（塞内加尔和印度尼西亚）的剧本变更，语言选择和意识形态
6. Written language impairments in primary progressive aphasia: A reflection of damage to central semantic and phonological processes [O] . Maya L. Henry, Pélagie M. Beeson, Gene E. Alexander, -1

机译：书面语言障碍在原发性进行性失语：损坏的中央语义和语音处理的反射
7. Finding ideographic representations of Japanese names written in Latin script via language identification and corpus validation [O] . Yan Qu, Gregory Grefenstette 2004

机译：通过语言识别和语料库验证查找用拉丁语脚本中写入的日语名称的表现形式
8. Vegetation Fires and Air Pollution in SouthSoutheast Asia Analysis from Multi-Satellite Datasets. [R] . Vadrevu, K. P. 2016

机译：多星卫星数据集分析东南亚的植被火灾和空气污染。

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

摘要

著录项

相似文献

相关主题

期刊订阅