...
首页> 外文期刊>Journal of Integrative Bioinformatics >Improving imbalanced scientific text classification using sampling strategies and dictionaries
【24h】

Improving imbalanced scientific text classification using sampling strategies and dictionaries

机译:使用采样策略和词典改进不平衡的科学文本分类

获取原文

摘要

Summary Many real applications have the imbalanced class distribution problem, where one of the classes is represented by a very small number of cases compared to the other classes. One of the systems affected are those related to the recovery and classification of scientific documentation. Sampling strategies such as Oversampling and Subsampling are popular in tackling the problem of class imbalance. In this work, we study their effects on three types of classifiers (Knn, SVM and Naive-Bayes) when they are applied to search on the PubMed scientific database. Another purpose of this paper is to study the use of dictionaries in the classification of biomedical texts. Experiments are conducted with three different dictionaries (BioCreative, NLPBA, and an ad-hoc subset of the UniProt database named Protein) using the mentioned classifiers and sampling strategies. Best results were obtained with NLPBA and Protein dictionaries and the SVM classifier using the Subsampling balancing technique. These results were compared with those ob- tained by other authors using the TREC Genomics 2005 public corpus.
机译:发明内容许多真实应用具有不平衡的类分布问题,其中其中一个类由与其他类相比非常少量的情况表示。受影响的系统之一是与科学文档的恢复和分类有关的系统。超采样和分支等采样策略在解决类别不平衡问题时很受欢迎。在这项工作中,我们在应用于搜索PubMed Scientific数据库时,我们对三种类型的分类器(KNN,SVM和Naive-Bayes)进行影响。本文的另一个目的是研究在生物医学文本的分类中使用字典。使用提到的分类器和采样策略,用三个不同的词典(BioCropive,NLPBA和命名蛋白质的ad-hoc子集)进行实验。使用来自子采样平衡技术的NLPBA和蛋白质词典和SVM分类器获得最佳结果。将这些结果与其他作者使用Trec Genomics 2005 Public Corpus进行了比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号