Diacritic-Based Matching of Arabic Words

Jarrar Mustafa; Zaraket Fadi; Asia Rami; Amayreh Hamzeh

首页> 外文期刊>ACM transactions on Asian language information processing >Diacritic-Based Matching of Arabic Words

【24h】

Diacritic-Based Matching of Arabic Words

机译：基于变音符号的阿拉伯语单词匹配

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Words in Arabic consist of letters and short vowel symbols called diacritics inscribed atop regular letters. Changing diacritics may change the syntax and semantics of a word; turning it into another. This results in difficulties when comparing words based solely on string matching. Typically, Arabic NLP applications resort to morphological analysis to battle ambiguity originating from this and other challenges. In this article, we introduce three alternative algorithms to compare two words with possibly different diacritics. We propose the Subsume knowledge-based algorithm, the Imply rule-based algorithm, and the Alike machine-learning-based algorithm. We evaluated the soundness, completeness, and accuracy of the algorithms against a large dataset of 86,886 word pairs. Our evaluation shows that the accuracy of Subsume (100%), Imply (99.32%), and Alike (99.53%). Although accurate, Subsume was able to judge only 75% of the data. Both Subsume and Imply are sound, while Alike is not. We demonstrate the utility of the algorithms using a real-life use case - in lemma disambiguation and in linking hundreds of Arabic dictionaries.

机译：阿拉伯语单词由字母和简短的元音符号组成，称为短音符号，刻在常规字母上。改变变音符号可能会改变单词的语法和语义。把它变成另一个。当仅基于字符串匹配比较单词时，这将导致困难。通常，阿拉伯语NLP应用程序会使用形态学分析来解决源自此挑战和其他挑战的歧义。在本文中，我们介绍了三种可供选择的算法来比较可能带有不同音素符号的两个单词。我们提出了基于Subsume知识的算法，基于Imply规则的算法和基于Alike机器学习的算法。我们针对86886个单词对的大型数据集评估了算法的健全性，完整性和准确性。我们的评估表明，Subsume（100％），Imply（99.32％）和Alike（99.53％）的准确性。尽管准确，但是Subsume只能判断75％的数据。 Subsume和Imply都是声音，而Alike则不是。我们使用真实的用例演示了算法的实用性-在引理消歧和链接数百个阿拉伯语字典中。

著录项

来源
《ACM transactions on Asian language information processing》 |2019年第2期|10.1-10.21|共21页
作者
Jarrar Mustafa; Zaraket Fadi; Asia Rami; Amayreh Hamzeh;
展开▼
作者单位

Birzeit Univ Comp Sci Dept 1 Almarj St Ramallah 627 West Bank Palestine;

Amer Univ Beirut 1107 Riad El Solh St Beirut 2020 Lebanon;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Arabic; diacritics; disambiguation;

机译：阿拉伯;变音符号消歧;

相似文献

外文文献
中文文献
专利

1. Image matching technique based on SURF descriptors for offline handwritten Arabic word segmentation [J] . Maamar Kef, Leila Chergui International Journal of Intelligent Systems Technologies and Applications . 2020,第3期

机译：基于冲浪描述符的图像匹配技术，用于离线手写阿拉伯语词分割
2. Character contiguity in N-gram-based word matching: the case for Arabic text searching [J] . Mustafa SH Information Processing & Management . 2005,第4期

机译：基于N元语法的单词匹配中的字符连续性：阿拉伯文本搜索的情况
3. Word-Oriented Approximate String Matching Using Occurrence Heuristic Tables: A Heuristic for Searching Arabic Text [J] . Suleiman H. Mustafa Journal of the American Society for Information Science and Technology . 2005,第14期

机译：使用出现启发式表的单词定向近似字符串匹配：搜索阿拉伯文本的启发式
4. Enabling Indexing and Retrieval of Historical Arabic Manuscripts through Template Matching Based Word Spotting [C] . Tayyeba Faisal, Somaya AlMaadeed International Workshop on Arabic Script Analysis and Recognition . 2017

机译：通过基于模板匹配的Word Spotting实现历史阿拉伯语手稿的索引和检索
5. Foreign words in the Arabic press: A study of the impact of western languages on Arabic. [D] . Araj, Samia Jabra. 1993

机译：阿拉伯媒体中的外来词：西方语言对阿拉伯语的影响研究。
6. Age of acquisition of 299 words in seven languages: American English, Czech, Gaelic, Lebanese Arabic, Malay, Persian and Western Armenian [O] . Magdalena Łuniewska, Zofia Wodniecka, Carol A. Miller, 2012

机译：七种语言获得299个单词的年龄：美式英语，捷克语，盖尔语，黎巴嫩阿拉伯语，马来语，波斯语和西亚美尼亚语
7. Word-level recognition of multifont Arabic text using a feature-vector matching approach [O] . Erik J. Erlandson, John M. Trenkle, Robert C. Vogt 1996

机译：使用特征向量匹配方法的多字体阿拉伯文字的单词级识别

Diacritic-Based Matching of Arabic Words

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅