首页> 外文期刊>BMC Bioinformatics >Building a protein name dictionary from full text: a machine learning term extraction approach
【24h】

Building a protein name dictionary from full text: a machine learning term extraction approach

机译:从全文构建蛋白质名称词典:机器学习术语提取方法

获取原文
       

摘要

Background The majority of information in the biological literature resides in full text articles, instead of s. Yet, s remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. Results We present an approach to recognize named entities in full text. The approach collects high frequency terms in an article, and uses support vector machines (SVM) to identify biological entity names. It is also computationally efficient and robust to noise commonly found in full text material. We use the method to create a protein name dictionary from a set of 80,528 full text articles. Only 8.3% of the names in this dictionary match SwissProt description lines. We assess the quality of the dictionary by studying its protein name recognition performance in full text. Conclusion This dictionary term lookup method compares favourably to other published methods, supporting the significance of our direct extraction approach. The method is strong in recognizing name variants not found in SwissProt.
机译:背景信息生物文献中的大多数信息都位于全文文章中,而不是s。然而,它仍然是许多公开可用的文献数据挖掘工具的重点。大多数文献挖掘工具都依赖于生物学名称的现有词典,这些词典通常是从经过整理的基因或蛋白质数据库中提取的。这是一个局限性,因为这样的数据库对许多用于引用文献中生物实体的名称变体的覆盖率很低。结果我们提出了一种识别全文中命名实体的方法。该方法收集文章中的高频术语,并使用支持向量机(SVM)识别生物实体名称。它还具有高效的计算能力,并且对全文资料中常见的噪声具有鲁棒性。我们使用该方法从80,528条全文文章中创建蛋白质名称词典。该词典中只有8.3%的名称与SwissProt广告内容描述行匹配。我们通过全文研究其蛋白质名称识别性能来评估字典的质量。结论该词典术语查找方法优于其他已发布的方法,支持了我们直接提取方法的重要性。该方法可以很强地识别出SwissProt中找不到的名称变体。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号