首页> 外文期刊>Engineering Applications of Artificial Intelligence >KerMinSVM for imbalanced datasets with a case study on arabic comics classification
【24h】

KerMinSVM for imbalanced datasets with a case study on arabic comics classification

机译:KerMinSVM用于不平衡数据集,以阿拉伯漫画分类为例

获取原文
获取原文并翻译 | 示例
       

摘要

Many studies have been performed to classify large-sized text documents using different classifiers, ranging from simple distance classifiers such as K-Nearest-Neighbor (KNN) to more advanced classifiers such as Support Vector Machines. Traditional approaches fail when a short text is encountered due to sparsity resulting from a limited number of words. Another common problem in text classification is class imbalance (CI). CI occurs when one class of the data contains most of the samples while the other class contains only a few. Standard classifiers, when applied to imbalanced data, result in high accuracy for the majority class and low accuracy for the minority one. We were motivated to propose a novel framework for classifying the content of Arabic comics; therefore, we propose KerMinSVM, a kernel extension of our previously proposed MinSVM coupled with a new dimensionality featuring a reduction scheme based on word root frequency ratios (WRFR). KerMinSVM was tested on multiple imbalanced benchmark datasets, and the results were verified using three measures: accuracy, F-measure, and statistical analysis. WRFR was applied to the manual construction of the Arabic comic text dataset to detect strong content in children's comic books. Test results revealed that our proposed framework outperformed most of the methods for imbalanced datasets and short text classification.
机译:已经进行了许多研究以使用不同的分类器对大型文本文档进行分类,从简单的距离分类器(例如K最近邻(KNN))到更高级的分类器(例如支持向量机)。当遇到短文本时,传统方法会由于单词数量有限而导致的稀疏性而失败。文本分类中的另一个常见问题是类不平衡(CI)。当一类数据包含大多数样本而另一类仅包含少数样本时,就会发生CI。当将标准分类器应用于不平衡数据时,多数分类的准确性较高,而少数分类的准确性较低。我们有动力提出一个新颖的框架来对阿拉伯漫画的内容进行分类。因此,我们提出了KerMinSVM,它是我们先前提出的MinSVM的内核扩展,并结合了新的维度,该维度具有基于词根频率比(WRFR)的缩减方案。 KerMinSVM在多个不平衡基准数据集上进行了测试,并使用准确性,F度量和统计分析这三个度量对结果进行了验证。 WRFR被应用于阿拉伯漫画文本数据集的手动构建,以检测儿童漫画中的大量内容。测试结果表明,我们提出的框架优于不平衡数据集和短文本分类的大多数方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号