首页> 外文会议>International conference on language resources and evaluation >Annotating and Learning Morphological Segmentation of Egyptian Colloquial Arabic
【24h】

Annotating and Learning Morphological Segmentation of Egyptian Colloquial Arabic

机译:埃及口语阿拉伯语的注释和学习形态分割

获取原文

摘要

We present an annotation and morphological segmentation scheme for Egyptian Colloquial Arabic (ECA) with which we annotate user-generated content that significantly deviates from the orthographic and grammatical rules of Modem Standard Arabic and thus cannot be processed by the commonly used MSA tools. Using a per letter classification scheme in which each letter is classified as either a segment boundary or not, and using a memory-based classifier, with only word-internal context, prove effective and achieve a 92% exact match accuracy at the word level. The well-known MADA system achieves 81%, while the per letter classification scheme using the ATB achieves 82%. Error analysis shows that the major problem is that of character ambiguity, since the ECA orthography overloads the characters which would otherwise be more specific in MSA, like the differences between y (S) and Y (S) and A (l) < , (l), and < (l) which are collapsed to y (S) and A (l) respectively or even totally confused and interchangeable. While normalization helps alleviate orthographic inconsistencies, it aggravates the problem of ambiguity.
机译:我们为埃及口语阿拉伯语(ECA)提供了一种注释和形态学分割方案,通过它可以注释用户生成的内容,这些内容与现代标准阿拉伯语的字法和语法规则明显不同,因此无法通过常用的MSA工具进行处理。使用每个字母分类方案(其中每个字母都分类为段边界或不分类为段边界),以及仅基于单词内部上下文的基于内存的分类器,证明是有效的,并且在单词级别实现了92%的精确匹配精度。著名的MADA系统达到81%,而使用ATB的按字母分类的方案达到82%。误差分析表明,主要问题是字符歧义性问题,因为ECA拼字法会重载字符,否则这些字符在MSA中会更加具体,例如y(S)和Y(S)以及A(l)<,( l)和<(l)分别折叠为y(S)和A(l),甚至完全混淆和互换。归一化虽然可以减轻拼字法上的不一致,但是却加剧了歧义问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号