首页> 外文会议>Workshop on Computational Approaches to Linguistic Code-Switching >Transliteration for Low-Resource Code-Switching Texts: Building an Automatic Cyrillic-to-Latin Converter for Tatar
【24h】

Transliteration for Low-Resource Code-Switching Texts: Building an Automatic Cyrillic-to-Latin Converter for Tatar

机译:低资源代码切换文本的音译:为塔塔尔构建自动的西里尔 - 拉丁式转换器

获取原文

摘要

We introduce a Cyrillic-to-Latin transliterator for the Tatar language based on subword-level language identification. The transliteration is a challenging task due to the following two reasons. First, because modern Tatar texts often contain intra-word code-switching to Russian, a different transliteration set of rules needs to be applied to each morpheme depending on the language, which necessitates morpheme-level language identification. Second, the fact that Tatar is a low-resource language, with most of the texts in Cyrillic, makes it difficult to prepare a sufficient dataset. Given this situation, we proposed a transliteration method based on subword-level language identification. We trained a language classifier with monolingual Tatar and Russian texts, and applied different transliteration rules in accord with the identified language. The results demonstrate that our proposed method outscores other Tatar transliteration tools, and imply that it correctly transcribes Russian loanwords to some extent.
机译:基于子字级语言识别,我们为塔塔尔语言引入了一个西里尔 - 拉丁文音频。由于以下两个原因,音译是一个具有挑战性的任务。首先,由于现代Tatar文本通常包含字中的语义代码切换到俄语,因此需要根据语言应用不同的音译规则,这需要对每个语言进行应用,这需要传递语言级语言识别。其次,塔塔尔是一种低资源语言的事实,随着西里尔的大部分文本,使得难以准备足够的数据集。鉴于这种情况,我们提出了一种基于子字级语言识别的音译方法。我们培训了语言分类器,具有单向鞑靼人和俄语文本,并根据所确定的语言应用不同的音译规则。结果表明,我们所提出的方法占据了其他鞑靼音译工具,暗示它在一定程度上正确录制俄罗斯借词。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号