首页> 外文期刊>BMC Bioinformatics >Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation
【24h】

Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation

机译:成本敏感的分层文档分类,可对PubMed摘要进行分类以进行手动管理

获取原文
           

摘要

Background The Immune Epitope Database (IEDB) project manually curates information from published journal articles that describe immune epitopes derived from a wide variety of organisms and associated with different diseases. In the past, abstracts of scientific articles were retrieved by broad keyword queries of PubMed, and were classified as relevant (curatable) or irrelevant (not curatable) to the scope of the database by a Na?ve Bayes classifier. The curatable abstracts were subsequently manually classified into categories corresponding to different disease domains. Over the past four years, we have examined how to further improve this approach in order to enhance classification performance and to reduce the need for manual intervention. Results Utilizing 89,884 abstracts classified by a domain expert as curatable or uncuratable, we found that a SVM classifier outperformed the previously used Na?ve Bayes classifier for curatability predictions with an AUC of 0.899 and 0.854, respectively. Next, using a non-hierarchical and a hierarchical application of SVM classifiers trained on 22,833 curatable abstracts manually classified into three levels of disease specific categories we demonstrated that a hierarchical application of SVM classifiers outperformed non-hierarchical SVM classifiers for categorization. Finally, to optimize the hierarchical SVM classifiers' error profile for the curation process, cost sensitivity functions were developed to avoid serious misclassifications. We tested our design on a benchmark dataset of 1,388 references and achieved an overall category prediction accuracy of 94.4%, 93.9%, and 82.1% at the three levels of categorization, respectively. Conclusions A hierarchical application of SVM algorithms with cost sensitive output weighting enabled high quality reference classification with few serious misclassifications. This enabled us to significantly reduce the manual component of abstract categorization. Our findings are relevant to other databases that are developing their own document classifier schema and the datasets we make available provide large scale real-life benchmark sets for method developers.
机译:背景技术免疫表位数据库(IEDB)项目手动整理来自已发表期刊文章的信息,这些文章描述了源自多种生物并与不同疾病相关的免疫表位。过去,科学文章的摘要是通过PubMed的宽泛关键字查询来检索的,并且通过朴素贝叶斯分类器被分类为与数据库范围相关(可编辑)或不相关(不可编辑)。随后将可编辑的摘要手动分类为与不同疾病域相对应的类别。在过去的四年中,我们研究了如何进一步改进此方法,以增强分类性能并减少手动干预的需求。结果利用领域专家分类为可治愈或无法治愈的89,884个摘要,我们发现SVM分类器的可预测性性能优于先前使用的Naveve Bayes分类器,其AUC分别为0.899和0.854。接下来,使用在22,833个可分类摘要上进行训练的SVM分类器的非分层和分层应用程序,这些摘要可手动分类为疾病特定类别的三个级别,我们证明了SVM分类器的分层应用优于非分层SVM分类器。最后,为了优化分层SVM分类器的错误配置以进行管理,开发了成本敏感功能以避免严重的错误分类。我们在1388个参考文献的基准数据集上测试了我们的设计,并在三个分类级别上分别实现了94.4%,93.9%和82.1%的总体类别预测准确性。结论具有成本敏感的输出加权的SVM算法的分层应用实现了高质量的参考分类,几乎没有严重的错误分类。这使我们能够显着减少抽象分类的人工成分。我们的发现与正在开发自己的文档分类器架构的其他数据库相关,并且我们提供的数据集为方法开发人员提供了大规模的实际基准测试集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号