...
首页> 外文期刊>BMC Bioinformatics >Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles
【24h】

Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles

机译:利用可能阳性和未标记的数据来改善蛋白质相互作用的鉴定

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Background Experimentally verified protein-protein interactions (PPI) cannot be easily retrieved by researchers unless they are stored in PPI databases. The curation of such databases can be made faster by ranking newly-published articles' relevance to PPI, a task which we approach here by designing a machine-learning-based PPI classifier. All classifiers require labeled data, and the more labeled data available, the more reliable they become. Although many PPI databases with large numbers of labeled articles are available, incorporating these databases into the base training data may actually reduce classification performance since the supplementary databases may not annotate exactly the same PPI types as the base training data. Our first goal in this paper is to find a method of selecting likely positive data from such supplementary databases. Only extracting likely positive data, however, will bias the classification model unless sufficient negative data is also added. Unfortunately, negative data is very hard to obtain because there are no resources that compile such information. Therefore, our second aim is to select such negative data from unlabeled PubMed data. Thirdly, we explore how to exploit these likely positive and negative data. And lastly, we look at the somewhat unrelated question of which term-weighting scheme is most effective for identifying PPI-related articles. Results To evaluate the performance of our PPI text classifier, we conducted experiments based on the BioCreAtIvE-II IAS dataset. Our results show that adding likely-labeled data generally increases AUC by 3~6%, indicating better ranking ability. Our experiments also show that our newly-proposed term-weighting scheme has the highest AUC among all common weighting schemes. Our final model achieves an F-measure and AUC 2.9% and 5.0% higher than those of the top-ranking system in the IAS challenge. Conclusion Our experiments demonstrate the effectiveness of integrating unlabeled and likely labeled data to augment a PPI text classification system. Our mixed model is suitable for ranking purposes whereas our hierarchical model is better for filtering. In addition, our results indicate that supervised weighting schemes outperform unsupervised ones. Our newly-proposed weighting scheme, TFBRF, which considers documents that do not contain the target word, avoids some of the biases found in traditional weighting schemes. Our experiment results show TFBRF to be the most effective among several other top weighting schemes.
机译:背景技术实验验证的蛋白-蛋白相互作用(PPI)很难被研究人员检索,除非它们存储在PPI数据库中。通过对新发表的文章与PPI的相关性进行排名,可以更快地管理此类数据库,我们在此通过设计基于机器学习的PPI分类器来解决这一任务。所有分类器都需要标记数据,并且可用标记数据越多,它们就越可靠。尽管可以使用许多带有大量带有标签的商品的PPI数据库,但是将这些数据库合并到基础训练数据中实际上会降低分类性能,因为补充数据库可能未注释与基础训练数据完全相同的PPI类型。本文的首要目标是找到一种从此类补充数据库中选择可能的阳性数据的方法。但是,仅提取可能的正数据会偏向分类模型,除非还添加了足够的负数据。不幸的是,很难获得负面数据,因为没有资源可以汇编此类信息。因此,我们的第二个目标是从未标记的PubMed数据中选择此类阴性数据。第三,我们探索如何利用这些可能的正面和负面数据。最后,我们来看一个无关紧要的问题,即哪种术语加权方案最有效地识别与PPI相关的文章。结果为了评估PPI文本分类器的性能,我们基于BioCreAtIvE-II IAS数据集进行了实验。我们的结果表明,添加可能标记的数据通常会使AUC增加3〜6%,表明具有更好的排名能力。我们的实验还表明,我们新提出的期限加权方案在所有常见加权方案中具有最高的AUC。在IAS挑战中,我们的最终模型实现了F度量,并且比顶级系统的AUC分别高出2.9%和5.0%。结论我们的实验证明了整合未标记和可能标记的数据以增强PPI文本分类系统的有效性。我们的混合模型适合于排名目的,而我们的分层模型则更适合过滤。此外,我们的结果表明,监督加权方案优于无监督加权方案。我们新提出的加权方案TFBRF考虑了不包含目标词的文档,避免了传统加权方案中的某些偏见。我们的实验结果表明,TFBRF在其他几种顶级加权方案中是最有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号