首页> 外文期刊>Journal of Integrative Bioinformatics >Improving imbalanced scientific text classification using sampling strategies and dictionaries
【24h】

Improving imbalanced scientific text classification using sampling strategies and dictionaries

机译:使用采样策略和词典改善不平衡的科学文本分类

获取原文
           

摘要

Many real applications have the imbalanced class distribution problem, where one of the classes is represented by a very small number of cases compared to the other classes. One of the systems affected are those related to the recovery and classification of scientific documentation. Sampling strategies such as Oversampling and Subsampling are popular in tackling the problem of class imbalance. In this work, we study their effects on three types of classifiers (Knn, SVM and Naive-Bayes) when they are applied to search on the PubMed scientific database. Another purpose of this paper is to study the use of dictionaries in the classification of biomedical texts. Experiments are conducted with three different dictionaries (BioCreative, NLPBA, and an ad-hoc subset of the UniProt database named Protein) using the mentioned classifiers and sampling strategies. Best results were obtained with NLPBA and Protein dictionaries and the SVM classifier using the Subsampling balancing technique. These results were compared with those obtained by other authors using the TREC Genomics 2005 public corpus.
机译:许多实际应用程序都存在类分配不平衡的问题,与其他类相比,其中一个类所代表的案例数量很少。受影响的系统之一是与科学文献的恢复和分类有关的系统。在解决类不平衡问题时,诸如过采样和子采样之类的采样策略很受欢迎。在这项工作中,我们研究了将它们应用于PubMed科学数据库中的三种分类器(Knn,SVM和朴素贝叶斯)的影响。本文的另一个目的是研究字典在生物医学文本分类中的使用。使用上述分类器和采样策略,对三种不同的词典(BioCreative,NLPBA和UniProt数据库的临时子集,称为Protein)进行了实验。使用子采样平衡技术,使用NLPBA和蛋白质词典以及SVM分类器可获得最佳结果。将这些结果与其他作者使用TREC Genomics 2005公共语料库获得的结果进行了比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号