Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation

Emily Seymour; Rohini Damle; Alessandro Sette; Bjoern Peters

首页> 外文期刊>BMC Bioinformatics >Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation

【24h】

Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation

机译：成本敏感的分层文档分类，可对PubMed摘要进行分类以进行手动管理

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background The Immune Epitope Database (IEDB) project manually curates information from published journal articles that describe immune epitopes derived from a wide variety of organisms and associated with different diseases. In the past, abstracts of scientific articles were retrieved by broad keyword queries of PubMed, and were classified as relevant (curatable) or irrelevant (not curatable) to the scope of the database by a Na?ve Bayes classifier. The curatable abstracts were subsequently manually classified into categories corresponding to different disease domains. Over the past four years, we have examined how to further improve this approach in order to enhance classification performance and to reduce the need for manual intervention. Results Utilizing 89,884 abstracts classified by a domain expert as curatable or uncuratable, we found that a SVM classifier outperformed the previously used Na?ve Bayes classifier for curatability predictions with an AUC of 0.899 and 0.854, respectively. Next, using a non-hierarchical and a hierarchical application of SVM classifiers trained on 22,833 curatable abstracts manually classified into three levels of disease specific categories we demonstrated that a hierarchical application of SVM classifiers outperformed non-hierarchical SVM classifiers for categorization. Finally, to optimize the hierarchical SVM classifiers' error profile for the curation process, cost sensitivity functions were developed to avoid serious misclassifications. We tested our design on a benchmark dataset of 1,388 references and achieved an overall category prediction accuracy of 94.4%, 93.9%, and 82.1% at the three levels of categorization, respectively. Conclusions A hierarchical application of SVM algorithms with cost sensitive output weighting enabled high quality reference classification with few serious misclassifications. This enabled us to significantly reduce the manual component of abstract categorization. Our findings are relevant to other databases that are developing their own document classifier schema and the datasets we make available provide large scale real-life benchmark sets for method developers.

机译：背景技术免疫表位数据库（IEDB）项目手动整理来自已发表期刊文章的信息，这些文章描述了源自多种生物并与不同疾病相关的免疫表位。过去，科学文章的摘要是通过PubMed的宽泛关键字查询来检索的，并且通过朴素贝叶斯分类器被分类为与数据库范围相关（可编辑）或不相关（不可编辑）。随后将可编辑的摘要手动分类为与不同疾病域相对应的类别。在过去的四年中，我们研究了如何进一步改进此方法，以增强分类性能并减少手动干预的需求。结果利用领域专家分类为可治愈或无法治愈的89,884个摘要，我们发现SVM分类器的可预测性性能优于先前使用的Naveve Bayes分类器，其AUC分别为0.899和0.854。接下来，使用在22,833个可分类摘要上进行训练的SVM分类器的非分层和分层应用程序，这些摘要可手动分类为疾病特定类别的三个级别，我们证明了SVM分类器的分层应用优于非分层SVM分类器。最后，为了优化分层SVM分类器的错误配置以进行管理，开发了成本敏感功能以避免严重的错误分类。我们在1388个参考文献的基准数据集上测试了我们的设计，并在三个分类级别上分别实现了94.4％，93.9％和82.1％的总体类别预测准确性。结论具有成本敏感的输出加权的SVM算法的分层应用实现了高质量的参考分类，几乎没有严重的错误分类。这使我们能够显着减少抽象分类的人工成分。我们的发现与正在开发自己的文档分类器架构的其他数据库相关，并且我们提供的数据集为方法开发人员提供了大规模的实际基准测试集。

著录项

来源
《BMC Bioinformatics》 |2011年第1期|共页
作者
Emily Seymour; Rohini Damle; Alessandro Sette; Bjoern Peters;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生物科学;
关键词

相似文献

外文文献
中文文献
专利

1. Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts [J] . Bethany R. Harris, Chih-Hsuan Wei, Donghui Li, Database . 2012,第40期

机译：使用文本挖掘工具加速文献管理：使用PubTator整理PubMed摘要中的基因的案例研究
2. Cost-sensitive hierarchical classification for imbalance classes [J] . Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies . 2020,第8期

机译：不平衡类的成本敏感的分层分类
3. Cost-sensitive learning of hierarchical tree classifiers for large-scale image classification and novel category detection [J] . Fan Jianping, Zhang Ji, Mei Kuizhi, Pattern Recognition: The Journal of the Pattern Recognition Society . 2015,第5期

机译：用于大规模图像分类和新颖类别检测的分层树分类器的成本敏感型学习
4. Hierarchical-Document-Structure-Aware Attention with Adaptive Cost Sensitive Learning for Biomedical Document Classification [C] . Dandan Fang, Jinyong Zhang, Weizhong Zhao, IEEE International Conference on Big Data . 2019

机译：具有适应性的成本敏感学习对生物医学文献分类的分层文献结构意识
5. Adversarial Approach to Cost-Sensitive Classification and Sequence Tagging [D] . Asif, Kaiser Newaj. 2019

机译：对成本敏感分类和序列标记的对抗方法
6. Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation [O] . Emily Seymour, Rohini Damle, Alessandro Sette, 2011

机译：成本敏感的分层文档分类可对PubMed摘要进行分类以进行手动管理
7. Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation [O] . Seymour Emily, Damle Rohini, Sette Alessandro, 2011

机译：成本敏感的分层文档分类，可对PubMed摘要进行分类以进行手动管理
8. Cost Sensitive Online Multiple Kernel Classification. [R] . Sahoo,, Zhao, P., Hoi, S. 2016

机译：成本敏感的在线多核分类。

Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation

摘要

著录项

相似文献

相关主题

期刊订阅