Summary Many real applications have the imbalanced class distribution problem, where one of the classes is represented by a very small number of cases compared to the other classes. One of the systems affected are those related to the recovery and classification of scientific documentation. Sampling strategies such as Oversampling and Subsampling are popular in tackling the problem of class imbalance. In this work, we study their effects on three types of classifiers (Knn, SVM and Naive-Bayes) when they are applied to search on the PubMed scientific database. Another purpose of this paper is to study the use of dictionaries in the classification of biomedical texts. Experiments are conducted with three different dictionaries (BioCreative, NLPBA, and an ad-hoc subset of the UniProt database named Protein) using the mentioned classifiers and sampling strategies. Best results were obtained with NLPBA and Protein dictionaries and the SVM classifier using the Subsampling balancing technique. These results were compared with those ob- tained by other authors using the TREC Genomics 2005 public corpus.
展开▼
机译:发明内容许多真实应用具有不平衡的类分布问题,其中其中一个类由与其他类相比非常少量的情况表示。受影响的系统之一是与科学文档的恢复和分类有关的系统。超采样和分支等采样策略在解决类别不平衡问题时很受欢迎。在这项工作中,我们在应用于搜索PubMed Scientific数据库时,我们对三种类型的分类器(KNN,SVM和Naive-Bayes)进行影响。本文的另一个目的是研究在生物医学文本的分类中使用字典。使用提到的分类器和采样策略,用三个不同的词典(BioCropive,NLPBA和命名蛋白质的ad-hoc子集)进行实验。使用来自子采样平衡技术的NLPBA和蛋白质词典和SVM分类器获得最佳结果。将这些结果与其他作者使用Trec Genomics 2005 Public Corpus进行了比较。
展开▼