首页> 中文期刊> 《计算机技术与发展》 >基于交叉预测的蛋白质交互识别

基于交叉预测的蛋白质交互识别

         

摘要

目前,基于远监督的蛋白质交互关系抽取方法通过将知识库中的实体对与文本中的实体进行匹配来产生大规模的训练数据,有效地解决了标注数据不足的问题.然而,通过远监督产生的训练数据存在大量的噪音,因此文中提出了一种交叉预测的方法来清除训练数据中的噪音.首先将训练数据随机分为k组,取1组数据作为预测集,其余k-1组数据作为训练集,依次轮换训练集和预测集k次,每组数据都利用其余k-1组数据训练得到的模型来预测并去噪;然后将去噪后的数据重新组合得到新的训练数据,并用去噪前和去噪后的训练数据分别进行训练得到模型;最后用人工标注的语料分别对这两个模型进行测试.实验结果证明,交叉预测的方法可以有效识别出训练数据中的噪音,从而提高蛋白质交互关系的识别效果.%Currently,protein-protein interaction(PPI) extraction approach based on distant supervision gathers large scale of training data by aligning entity pairs in knowledge base with entities in text,efficiently solving the lack of hand-labeled data.However,some sentences are labeled wrongly.For this,we propose an approach of cross prediction to remove the noise in training data.Firstly,we divide training data into k folds randomly,and select 1 fold as predicting data and the other k-1 fold as training data.Then,interchanged training data and predicting data for k times in turn,the noise in each fold is predicted and reduced through the model trained by the data of the other k-1 folds.Next we combine every part of data after reducing noise in it,and train two different models using training data before and after re-moving noise.Lastly,we test two different model with hand-labeled corpora.The experiments show that the proposed method is effective in noise removal,thus boosted the performance of PPI extraction.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号