【24h】

Diacritic-Based Matching of Arabic Words

机译:基于变音符号的阿拉伯语单词匹配

获取原文
获取原文并翻译 | 示例

摘要

Words in Arabic consist of letters and short vowel symbols called diacritics inscribed atop regular letters. Changing diacritics may change the syntax and semantics of a word; turning it into another. This results in difficulties when comparing words based solely on string matching. Typically, Arabic NLP applications resort to morphological analysis to battle ambiguity originating from this and other challenges. In this article, we introduce three alternative algorithms to compare two words with possibly different diacritics. We propose the Subsume knowledge-based algorithm, the Imply rule-based algorithm, and the Alike machine-learning-based algorithm. We evaluated the soundness, completeness, and accuracy of the algorithms against a large dataset of 86,886 word pairs. Our evaluation shows that the accuracy of Subsume (100%), Imply (99.32%), and Alike (99.53%). Although accurate, Subsume was able to judge only 75% of the data. Both Subsume and Imply are sound, while Alike is not. We demonstrate the utility of the algorithms using a real-life use case - in lemma disambiguation and in linking hundreds of Arabic dictionaries.
机译:阿拉伯语单词由字母和简短的元音符号组成,称为短音符号,刻在常规字母上。改变变音符号可能会改变单词的语法和语义。把它变成另一个。当仅基于字符串匹配比较单词时,这将导致困难。通常,阿拉伯语NLP应用程序会使用形态学分析来解决源自此挑战和其他挑战的歧义。在本文中,我们介绍了三种可供选择的算法来比较可能带有不同音素符号的两个单词。我们提出了基于Subsume知识的算法,基于Imply规则的算法和基于Alike机器学习的算法。我们针对86886个单词对的大型数据集评估了算法的健全性,完整性和准确性。我们的评估表明,Subsume(100%),Imply(99.32%)和Alike(99.53%)的准确性。尽管准确,但是Subsume只能判断75%的数据。 Subsume和Imply都是声音,而Alike则不是。我们使用真实的用例演示了算法的实用性-在引理消歧和链接数百个阿拉伯语字典中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号