首页> 外文期刊>ACM transactions on Asian language information processing >Automatic Diacritics Restoration for Tunisian Dialect
【24h】

Automatic Diacritics Restoration for Tunisian Dialect

机译:突尼斯方言的自动变音符恢复

获取原文
获取原文并翻译 | 示例
       

摘要

Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics introduces several types of ambiguity. First, a word without diacratics could have many possible meanings depending on their diacritization. Second, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology [12]. In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics. Third, without diacritics a word could have many possible parts of speech (POS) instead of one. This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses [8]. Finally, there is ambiguity at the grammatical level (syntactic ambiguity). In this article, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts. We first describe our annotation guidelines and procedure. Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on Conditional Random Fields (CRF). In the second approach, we integrate POS features to influence the generation of diacritics. Diacritics restoration was performed at both the word and the character levels. The results showed high scores of automatic diacritization based on the CRF system (Word Error Rate (WER) 21.44% for CRF and WER 34.6% for SMT).
机译:现代标准阿拉伯语以及阿拉伯方言语言通常不带变音符号。在NLP工具自动处理这些数据时,缺少这些标记构成了一个真正的问题。确实,写阿拉伯语而不带变音符号会引入多种类型的歧义。首先,一个不具有发音符号的单词可能会由于其字词化而具有许多可能的含义。第二,阿拉伯语单词的不透音表面形式可能具有多达200个读数,具体取决于其形态的复杂性[12]。实际上,阿拉伯语的凝集特性可能会产生只能使用变音符号解决的问题。第三,没有变音符号,一个单词可能具有许多可能的词性(POS),而不是一个词性。具有相同拼写和POS标签但词义不同的单词就是这种情况,或者具有相同拼写但POS标签和词义不同的单词[8]。最后,在语法层次上存在歧义(句法歧义)。在本文中,我们提出了调查突尼斯方言文本自动二元化的第一项工作。我们首先描述注释准则和过程。然后,我们提出了两个主要模型,即统计机器翻译(SMT)和判别模型,作为基于条件随机场(CRF)的序列分类任务。在第二种方法中,我们集成了POS功能以影响变音符号的生成。在单词和字符级别都进行了变音符号恢复。结果显示,基于CRF系统的自动双歧化评分很高(CRF的单词错误率(WER)为21.44%,SMT的WER为34.6%)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号