...
首页> 外文期刊>Language Resources and Evaluation >Comparative evaluation of text classification techniques using a large diverse Arabic dataset
【24h】

Comparative evaluation of text classification techniques using a large diverse Arabic dataset

机译:使用大量不同的阿拉伯数据集进行文本分类技术的比较评估

获取原文
获取原文并翻译 | 示例

摘要

A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experimentation has been done with Arabic. This paper presents, first, an up-to-date review of the work done in the field of Arabic text classification and, second, a large and diverse dataset that can be used for benchmarking Arabic text classification algorithms. The different techniques derived from the literature review are illustrated by their application to the proposed dataset. The results of various feature selections, weighting methods, and classification algorithms show, on average, the superiority of support vector machine, followed by the decision tree algorithm (C4.5) and Naïve Bayes. The best classification accuracy was 97 % for the Islamic Topics dataset, and the least accurate was 61 % for the Arabic Poems dataset.
机译:文档中记录了大量有价值的人类知识。用于公共或私人访问的机器可读文档的数量迅速增长,因此必须使用自动文本分类。尽管西方语言(主要是英语)已经投入了很多精力,但对阿拉伯语的尝试却很少。本文首先介绍了在阿拉伯文本分类领域中所做工作的最新回顾,其次是可用于基准化阿拉伯文本分类算法的庞大而多样的数据集。从文献综述中得出的不同技术通过将其应用于建议的数据集进行了说明。平均而言,各种特征选择,加权方法和分类算法的结果显示了支持向量机的优越性,其次是决策树算法(C4.5)和朴素贝叶斯。伊斯兰主题数据集的分类准确度最高为97%,而阿拉伯诗词数据集的最低准确度为61%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号