...
首页> 外文期刊>BMC Bioinformatics >BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature
【24h】

BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature

机译:Biadi:一种机器学习方法,用于识别生物文学中的缩写和定义

获取原文
           

摘要

Background To automatically process large quantities of biological literature for knowledge discovery and information curation, text mining tools are becoming essential. Abbreviation recognition is related to NER and can be considered as a pair recognition task of a terminology and its corresponding abbreviation from free text. The successful identification of abbreviation and its corresponding definition is not only a prerequisite to index terms of text databases to produce articles of related interests, but also a building block to improve existing gene mention tagging and gene normalization tools. Results Our approach to abbreviation recognition (AR) is based on machine-learning, which exploits a novel set of rich features to learn rules from training data. Tested on the AB3P corpus, our system demonstrated a F-score of 89.90% with 95.86% precision at 84.64% recall, higher than the result achieved by the existing best AR performance system. We also annotated a new corpus of 1200 PubMed s which was derived from BioCreative II gene normalization corpus. On our annotated corpus, our system achieved a F-score of 86.20% with 93.52% precision at 79.95% recall, which also outperforms all tested systems. Conclusion By applying our system to extract all short form-long form pairs from all available PubMed s, we have constructed BIOADI. Mining BIOADI reveals many interesting trends of bio-medical research. Besides, we also provide an off-line AR software in the download section on http://bioagent.iis.sinica.edu.tw/BIOADI/ .
机译:背景技术为知识发现和信息策策自动处理大量的生物学文献,文本挖掘工具正成为必不可少的。缩写识别与ner有关,可以被视为术语的对识别任务及其与自由文本的相应缩写。缩写的成功识别及其相应的定义不仅是索引文本数据库的前提条款,以产生相关兴趣的文章,而且是改善现有基因提及标记和基因标准化工具的构建块。结果我们的缩写识别方法(AR)是基于机器学习,它利用一组新颖的丰富功能来从训练数据中学习规则。在AB3P语料库上进行测试,我们的系统展示了89.90%的F分,精度为84.64%的召回,高于现有最佳AR性能系统所实现的结果。我们还注释了一种新的1200个Pubmed S的语料库,它来自生物重建II基因标准化语料库。在我们的注释语料库中,我们的系统达到了86.20%的F分,93.52%的精度在79.95%的召回,这也优于所有测试系统。结论通过将我们的系统应用于从所有可用的PubMed S中提取所有短的形式的长形对,我们建造了Biadi。矿业比罗揭示了生物医学研究的许多有趣趋势。此外,我们还在下载部分中提供了一个离线AR软件http://bioagent.iis.sinica.edu.tw/bioadi/。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号