首页> 外文会议> >Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus
【24h】

Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus

机译:Arabizi音译成阿拉伯语正字法:开发并行注释的Arabizi-Arabic脚本SMS / Chat语料库

获取原文
获取原文并翻译 | 示例

摘要

This paper describes the process of creating a novel resource, a parallel Arabizi-Arabic script corpus of SMS/Chat data. The language used in social media expresses many differences from other written genres: its vocabulary is informal with intentional deviations from standard orthography such as repeated letters for emphasis; typos and non-standard abbreviations are common; and non-linguistic content is written out, such as laughter, sound representations, and emoti-cons. This situation is exacerbated in the case of Arabic social media for two reasons. First, Arabic dialects, commonly used in social media, are quite different from Modern Standard Arabic phonologically, morphologically and lexically, and most importantly, they lack standard orthographies. Second, Arabic speakers in social media as well as discussion forums, SMS messaging and online chat often use a non-standard romani-zation called Arabizi. In the context of natural language processing of social media Arabic, transliterating from Arabizi of various dialects to Arabic script is a necessary step, since many of the existing state-of-the-art resources for Arabic dialect processing expect Arabic script input. The corpus described in this paper is expected to support Arabic NLP by providing this resource.
机译:本文介绍了创建新型资源的过程,该资源是SMS / Chat数据的并行阿拉伯语-阿拉伯语脚本集。社交媒体中使用的语言与其他书面体裁表现出许多差异:其词汇是非正式的,有意偏离标准拼字法,例如反复强调字母;错字和非标准缩写很常见;并写出非语言内容,例如笑声,声音表示和表情符号。阿拉伯社会媒体的情况更加恶化,原因有两个。首先,社交媒体中常用的阿拉伯方言在语音,形态和词汇上与现代标准阿拉伯语有很大不同,最重要的是,它们缺乏标准的拼字法。其次,社交媒体以及论坛,短信和在线聊天中讲阿拉伯语的人经常使用称为Arabizi的非标准罗马化。在社交媒体阿拉伯语的自然语言处理中,从各种方言的Arabizi到阿拉伯语的音译是必不可少的步骤,因为许多现有的用于阿拉伯方言处理的最新资源都希望输入阿拉伯语。本文中提供的语料库有望通过提供此资源来支持阿拉伯语NLP。

著录项

  • 来源
    《》|2014年|93-103|共11页
  • 会议地点 Doha(QA)
  • 作者单位

    Linguistic Data Consortium, University of Pennsylvania;

    Linguistic Data Consortium, University of Pennsylvania;

    Linguistic Data Consortium, University of Pennsylvania;

    Linguistic Data Consortium, University of Pennsylvania;

    Linguistic Data Consortium, University of Pennsylvania;

    Linguistic Data Consortium, University of Pennsylvania;

    Linguistic Data Consortium, University of Pennsylvania;

    Computer Science Department, New York University Abu Dhabi;

    Center for Computational Learning Systems, Columbia University;

    Center for Computational Learning Systems, Columbia University;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号