AUTOMATIC MACHINE LEARNING TECHNIQUES (AMLT) FOR ARABIC TEXT CLASSIFICATION BASED ON TERM COLLOCATIONS

FEKRY OLAYAH; WASEEM ALROMIMA

摘要

Due to the rapid and increased availability of documents in a digital format, effect for retrieving information with highest accuracy and the lowest error rate is becoming more difficult. Text Classification (TC) has become one of the key techniques for controlling and organizing documents based on the content of documents. Therefore, keyword extraction is one of the most important natural language processing applications, which extracts information from the document such as term collocations, which are two or more words appear together and always seem as associated. In Arabic language, there are many problems in keyword extraction because of the complexity of Arabic orthography. Moreover, the accuracy is affecting by the document content and the classification technique used. The need for automatic text classification came from a large amount of electronic documents on the web. This research aims to propose an Automatic Machine Learning Techniques (AMLT) for classifying Arabic documents by using term collocations. These collocations are mined from Arabic documents, the extracted term collocations will scoring by using association measure and will be used as terms feature selection. To achieve this study, we used Arabic documents divided into four categories (Economy/ business, Politics, Religion and Science). The results of our approach have compared with the full-document approach and summary-document approach using four techniques (SVM, NB, J48, and KNN) for Arabic documents to determine which classifier is more accurate for Arabic text based on term collocation. The evaluation results proved that our proposed approach outperforms the other method in accuracy.

机译：由于以数字格式快速和更高的文档可用性，效果用于检索具有最高精度和最低错误率的信息变得越来越困难。文本分类（TC）已成为根据文档内容控制和组织文档的关键技术之一。因此，关键字提取是最重要的自然语言处理应用程序之一，它从诸如术语搭配之类的文档中提取信息，这是两个或多个单词一起出现，并且始终似乎相关联。用阿拉伯语，由于阿拉伯语拼图的复杂性，关键词提取存在许多问题。此外，准确性受到文档内容和使用的分类技术影响。对自动文本分类的需求来自Web上的大量电子文档。本研究旨在提出自动机器学习技术（AMLT），用于使用术语搭配进行分类阿拉伯文档。这些搭配从阿拉伯文档中开采，提取的术语搭配将通过使用关联度量进行评分，并将用作术语特征选择。为实现这项研究，我们使用阿拉伯文文件分为四类（经济/业务，政治，宗教和科学）。我们的方法的结果与使用四种技术（SVM，NB，J48和KNN）用于阿拉伯文档的全文件方法和摘要 - 文档方法，以确定哪些分类器基于术语搭配的阿拉伯文本更准确。评价结果证明，我们提出的方法以准确性越优越其他方法。

AUTOMATIC MACHINE LEARNING TECHNIQUES (AMLT) FOR ARABIC TEXT CLASSIFICATION BASED ON TERM COLLOCATIONS

摘要

著录项

相关主题

期刊订阅