首页> 外文会议>Workshop on Computational Approaches to Linguistic Code-Switching >Normalization and Back-Transliteration for Code-Switched Data
【24h】

Normalization and Back-Transliteration for Code-Switched Data

机译:代码切换数据的归一化和背部音译

获取原文

摘要

Code-switching is an omnipresent phenomenon in multilingual communities alt around the world but remains a challenge for NLP systems due to the lack of proper data and processing techniques. Hindi-English code-switched text on social media is often transliterated to the Roman script which prevents from utilizing monolingual resources available in the native Devanagari script. In this paper, we propose a method to normalize and back-transliterate code-switched Hindi-English text. In addition, we present a grapheme-to-phoneme (G2P) conversion technique for romanized Hindi data. We also release a dataset of script-corrected Hindi-English code-switched sentences labeled for the named entity recognition and part-of-speech tagging tasks to facilitate further research in this area.
机译:代码切换是世界各地的多语种社区ALT的全能现象,但由于缺乏适当的数据和处理技术,对NLP系统仍然是一个挑战。 Hindi-English-English-Code-Switched文本上的社交媒体上通常是翻译成罗马脚本,该脚本可防止使用本机Devanagari脚本中可用的单声道资源。 在本文中,我们提出了一种规范化和背部音译代码切换的印度语文本的方法。 此外,我们介绍了一种用于罗马化的印地语数据的标记 - 音素(G2P)转换技术。 我们还释放了脚本更正的HINDI-English-English-Switched句子的数据集标记为命名实体识别和演讲部分标记任务,以促进该领域的进一步研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号