...
首页> 外文期刊>Journal of computer sciences >ARABIC PART OF SPEECH TAGGING USING K-NEAREST NEIGHBOUR AND NAIVE BAYES CLASSIFIERS COMBINATION | Science Publications
【24h】

ARABIC PART OF SPEECH TAGGING USING K-NEAREST NEIGHBOUR AND NAIVE BAYES CLASSIFIERS COMBINATION | Science Publications

机译:K-NEAREST NEIGHBOR和朴素贝叶斯分类器组合的语音标记的阿拉伯语部分|科学出版物

获取原文

摘要

> Part Of Speech (POS) tagging forms the important preprocessing step in many of the natural language processing applications such as text summarization, question answering and information retrieval system. It is the process of classifying every word in a given context to its appropriate part of speech. Different POS tagging techniques in the literature have been developed and experimented. Currently, it is well known that some POS tagging models are not performing well on the Quranic Arabic due to the complexity of the Quranic Arabic text. This complexity presents several challenges for POS tagging such as high ambiguity, data sparseness and large existence of unknown words. With this in mind, the main problem here is to find out how existing and efficient methods perform in Arabic and how can Quranic corpus be utilized to produce an efficient framework for Arabic POS tagging. We propose a classifiers combination experimental framework for Arabic POS tagger, by selecting two best diverse probabilistic classifiers used in numerous works in non-Arabic language; namely K-Nearest Neighbour (KNN) and Naive Bayes (NB). The Majority voting is used here as the combination strategy to exploit classifiers advantages. In addition, an in-depth study has been conducted on a large list of features for exploiting effective features and investigating their role in enhancing the performance of POS taggers for the Quranic Arabic. Hence, this study aims to efficiently integrate different feature sets and tagging algorithms to synthesize more accurate POS tagging procedure. The data used in this study is the Arabic Quranic Corpus, an annotated linguistic resource consisting of 77,430 words with Arabic grammar, syntax and morphology for each word in the Holy Quran. The highest accuracy in the results achieved is 98.32%, which can be a significant enhancement for the state-of-the-art for Arabic Quranic text. The most effective features that yield this accuracy are a combination of w0 (the current word), p0 (POS of the current word), p-3 (POS of three words before), p-2 (POS of two words before) and p-1 (POS of the word before).
机译: >词性(POS)标记构成了许多自然语言处理应用程序中重要的预处理步骤,例如文本摘要,问题回答和信息检索系统。这是将给定上下文中的每个单词分类为其相应词性的过程。文献中已经开发和试验了不同的POS标签技术。当前,众所周知,由于古兰经阿拉伯文本的复杂性,一些POS标记模型在古兰经阿拉伯语上表现不佳。这种复杂性给POS标记带来了一些挑战,例如高度歧义,数据稀疏和未知单词的大量存在。考虑到这一点,这里的主要问题是找出现有的有效方法如何在阿拉伯语中执行,以及如何利用古兰经语料库来生成用于阿拉伯语POS标签的有效框架。我们通过选择在非阿拉伯语言的众多作品中使用的两个最佳的多样化概率分类器,为阿拉伯语POS标记器提供分类器组合实验框架。即K最近邻居(KNN)和朴素贝叶斯(NB)。多数投票在此处用作组合策略,以利用分类器的优势。此外,已经对大量功能进行了深入研究,以开发有效功能并调查其在增强古兰经阿拉伯语POS标记器性能方面的作用。因此,本研究旨在有效地集成不同的功能集和标记算法,以合成更准确的POS标记过程。本研究中使用的数据是阿拉伯语古兰经语料库,这是一种带注释的语言资源,由77,430个单词组成,带有阿拉伯语语法,语法和词法,每个单词均属于古兰经。所获得结果的最高准确性为98.32%,对于阿拉伯语古兰经文本的最新技术而言,这可能是一个显着增强。产生此精度的最有效特征是w 0 (当前单词),p 0 (当前单词的POS),p -3的组合(前面三个单词的POS),p -2 (前面两个单词的POS)和p -1 (前面单词的POS)。

著录项

相似文献

  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号