Improving imbalanced scientific text classification using sampling strategies and dictionaries

Lourdes Borrajo; Rubén Romero; Eva Lorenzo Iglesias; Carmen María Redondo Marey

首页> 外文期刊>Journal of Integrative Bioinformatics >Improving imbalanced scientific text classification using sampling strategies and dictionaries

【24h】

Improving imbalanced scientific text classification using sampling strategies and dictionaries

机译：使用采样策略和词典改善不平衡的科学文本分类

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many real applications have the imbalanced class distribution problem, where one of the classes is represented by a very small number of cases compared to the other classes. One of the systems affected are those related to the recovery and classification of scientific documentation. Sampling strategies such as Oversampling and Subsampling are popular in tackling the problem of class imbalance. In this work, we study their effects on three types of classifiers (Knn, SVM and Naive-Bayes) when they are applied to search on the PubMed scientific database. Another purpose of this paper is to study the use of dictionaries in the classification of biomedical texts. Experiments are conducted with three different dictionaries (BioCreative, NLPBA, and an ad-hoc subset of the UniProt database named Protein) using the mentioned classifiers and sampling strategies. Best results were obtained with NLPBA and Protein dictionaries and the SVM classifier using the Subsampling balancing technique. These results were compared with those obtained by other authors using the TREC Genomics 2005 public corpus.

机译：许多实际应用程序都存在类分配不平衡的问题，与其他类相比，其中一个类所代表的案例数量很少。受影响的系统之一是与科学文献的恢复和分类有关的系统。在解决类不平衡问题时，诸如过采样和子采样之类的采样策略很受欢迎。在这项工作中，我们研究了将它们应用于PubMed科学数据库中的三种分类器（Knn，SVM和朴素贝叶斯）的影响。本文的另一个目的是研究字典在生物医学文本分类中的使用。使用上述分类器和采样策略，对三种不同的词典（BioCreative，NLPBA和UniProt数据库的临时子集，称为Protein）进行了实验。使用子采样平衡技术，使用NLPBA和蛋白质词典以及SVM分类器可获得最佳结果。将这些结果与其他作者使用TREC Genomics 2005公共语料库获得的结果进行了比较。

著录项

来源
《Journal of Integrative Bioinformatics》 |2011年第3期|共页
作者
Lourdes Borrajo; Rubén Romero; Eva Lorenzo Iglesias; Carmen María Redondo Marey;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生物信息论;
关键词

相似文献

外文文献
中文文献
专利

1. Improving imbalanced scientific text classification using sampling strategies and dictionaries [J] . L. Borrajo, R. Romero, E. L. Iglesias, Journal of Integrative Bioinformatics . 2011,第3期

机译：使用采样策略和词典改进不平衡的科学文本分类
2. Sample cutting method for imbalanced text sentiment classification based on BRC [J] . Suge Wang, Deyu Li, Lidong Zhao, Knowledge-Based Systems . 2013,第JANa期

机译：基于BRC的不平衡文本情感分类的样本切割方法
3. On strategies for imbalanced text classification using SVM: A comparative study [J] . Aixin Sun, Ee-Peng Lim, Ying Liu Decision support systems . 2009,第1期

机译：基于SVM的不平衡文本分类策略的比较研究
4. A bi-directional sampling based on K-means method for imbalance text classification [C] . Jia Song, Xianglin Huang, Sijun Qin, IEEE/ACIS International Conference on Computer and Information Science . 2016

机译：基于K-means方法的双向采样不平衡文本分类
5. Alleviating class imbalance using data sampling: Examining the effects on classification algorithms. [D] . Napolitano, Amri E. 2006

机译：使用数据采样缓解类不平衡：检查对分类算法的影响。
6. An improved survivability prognosis of breast cancer by using sampling and feature selection technique to solve imbalanced patient classification data [O] . Kung-Jeng Wang, Bunjira Makond, Kung-Min Wang 2013

机译：通过使用采样和特征选择技术解决不平衡的患者分类数据提高乳腺癌的生存率
7. Improving imbalanced scientific text classification using sampling strategies and dictionaries [O] . Borrajo Lourdes, Romero Rubén, Lorenzo Iglesias Eva, 2011

机译：使用采样策略和词典改善不平衡的科学文本分类

Improving imbalanced scientific text classification using sampling strategies and dictionaries

摘要

著录项

相似文献

相关主题

期刊订阅