首页> 外文期刊>ACM transactions on Asian and low-resource language information processing >Transliteration of Arabizi into Arabic Script for Tunisian Dialect
【24h】

Transliteration of Arabizi into Arabic Script for Tunisian Dialect

机译:爪子对突尼斯方言阿拉伯语脚本的译力

获取原文
获取原文并翻译 | 示例

摘要

The evolution of information and communication technology has markedly influenced communication between correspondents. This evolution has facilitated the transmission of information and has engendered new forms of written communication (email, chat, SMS, comments, etc.). Most of these messages and comments are written in Latin script, also called Arabizi. Moreover, the language used in social media and SMS messaging is characterized by the use of informal and non-standard vocabulary, such as repeated letters for emphasis, typos, non-standard abbreviations, and nonlinguistic content like emoticons. Since the Tunisian dialect suffers from the unavailability of basic tools and linguistic resources compared to Modern Standard Arabic, we resort to the use of these written sources as a starting point to build large corpora automatically. In the context of natural language processing and to benefit from these networks' data, transliterating from Arabizi to Arabic script is a necessary step because most recently available tools for processing the Tunisian dialect expect Arabic script input. Indeed, the transliteration task can help construct and enrich parallel corpora and dictionaries for the Tunisian dialect and can be useful for developing various natural language processing applications such as sentiment analysis, opinion mining, topic detection, and machine translation. In this article, we focus on converting the Tunisian dialect text that is written in Latin script to Arabic script following the Conventional Orthography for Dialectal Arabic. Then, we propose two models to transliterate Arabizi into Arabic script for the Tunisian dialect, namely a rule-based model and a discriminative model as a sequence classification task based on conditional random fields). In the first model, we use a set of transliteration rules to convert the Tunisian dialect Arabizi texts to Arabic script. In the second model, transliteration is performed both at word and character levels. In the end, our models got a character error rate of 10.47%.
机译:信息和通信技术的演变显着影响了记者之间的沟通。这种进化促进了信息传输,并有创办的新形式的书面通信(电子邮件,聊天,短信,评论等)。这些消息和评论中的大多数都以拉丁文脚本编写,也称为Arabizi。此外,社交媒体和SMS消息传递中使用的语言的特点是使用非正式和非标准词汇,例如重复的重点,拼写错误,非标准缩写和非语言内容,如表情符号。由于突尼斯方言遭受了基本工具和语言资源的不可用与现代标准阿拉伯语相比,我们求助于使用这些书面来源作为自动构建大公司的起点。在自然语言处理和受益于这些网络数据的背景下,从Arabizi到阿拉伯语脚本的音译是一个必要的步骤,因为最近用于处理突尼斯方言的最近可用工具期望阿拉伯语脚本输入。实际上,音译任务可以帮助构建并丰富并行语言和字典,为突尼斯方言进行帮助,并且可用于开发各种自然语言处理应用,例如情感分析,意见采矿,主题检测和机器翻译。在本文中,我们专注于转换以拉丁文脚本写入的突尼斯方言文本,以便在传统的副作表中以辩证辩护方式进行言喻。然后,我们提出了两个模型来将rabizi翻译成突尼斯方言的阿拉伯语脚本,即基于规则的模型和基于条件随机字段的序列分类任务的判别模型。在第一个模型中,我们使用一组音译规则将突尼斯方言rabizi文本转换为阿拉伯语脚本。在第二模型中,在Word和字符级别执行音译。最终,我们的模型具有10.47%的字符错误率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号