首页> 外文会议>International Conference on Soft Computing and Intelligent Systems;International Symposium on Advanced Intelligent Systems >Protein named entity classification with probabilistic features derived from GENIA corpus and MEDLINE
【24h】

Protein named entity classification with probabilistic features derived from GENIA corpus and MEDLINE

机译:蛋白质命名实体分类,具有来自Genia Corpus和Medline的概率特征

获取原文

摘要

Biome?dical named entity recognition (BNER) is one of the most essential and initial tasks (discovering relations between biome?dical entities, identifying molecular pathways, etc.) of biome?dical information retrieval. Although named entity recognition performed well in ordinary text, it still remains challenging in molecular biology domain because of the complex nature of biome?dical nomenclature, different kinds of spelling forms and many more reasons. Even though biome?dical entities in biological text are found successfully, classifying them into relevant biome?dical classes such as genes, proteins, diseases, drug names, etc. is still another challenge and an open question. This paper presents a new method to classify biome?dical named entities into protein and non-protein classes. Our approach employs Random Forest, a machine learning algorithm, with a new combination of features. They are orthographic, keyword and morphological, as well as a probabilistic feature called Proteinhood and a Protein-Score feature based on the Medline abstracts cited on the Pubmed, which are the main contributions in the paper. A series of experiments is conducted to compare the proposed approach with other state of the art approaches. Our protein named entity classifier shows significant performance in the experiments on GENIA corpus achieving the highest values of precision 93.8%, recall 83.8% and F-measure 88.5% for protein named entity identification. In this study we showed the effect of new Proteinhood and Protein-Score features as well as adjusting parameters of Random Forest algorithm.
机译:生物群系?DICE命名实体识别(BNER)是最重要和最初的任务之一(发现生物群系之间的关系,生物群体之间的关系,识别分子途径等)的生物群系的DIC族信息检索。虽然命名实体识别在普通文本中表现良好,但由于生物群落的复杂性,但分子生物学域仍然仍然具有挑战性,因为生物群落的复杂性,不同种类的拼写形式等等。即使生物群系?生物学文本中的直接实体成功地发现,将它们分类为相关的生物群系?直接类别,如基因,蛋白质,疾病,药物名称等仍然是另一个挑战和一个开放的问题。本文提出了一种将生物群系的新方法进行分类为蛋白质和非蛋白质课程。我们的方法采用随机林,机器学习算法,具有新的功能组合。它们是正常的,关键词和形态学,以及基于PubMed上引用的Medline摘要的概率特征和蛋白质评分特征,这是纸张中的主要贡献。进行了一系列实验以比较拟议的方法与其他现有技术的方法。我们的蛋白质命名实体分类器在Genia Corpus的实验中显示出显着的性能,实现了最高精度的93.8%,召回了83.8%和F-Peactor 88.5%的蛋白质命名实体识别。在这项研究中,我们展示了新的蛋白质和蛋白质评分特征的影响以及随机林算法的调整参数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号