首页> 外文期刊>BMC Bioinformatics >Machine learning with naturally labeled data for identifying abbreviation definitions
【24h】

Machine learning with naturally labeled data for identifying abbreviation definitions

机译:带有自然标记数据的机器学习,用于识别缩写定义

获取原文
           

摘要

BackgroundThe rapid growth of biomedical literature requires accurate text analysis and text processing tools. Detecting abbreviations and identifying their definitions is an important component of such tools. Most existing approaches for the abbreviation definition identification task employ rule-based methods. While achieving high precision, rule-based methods are limited to the rules defined and fail to capture many uncommon definition patterns. Supervised learning techniques, which offer more flexibility in detecting abbreviation definitions, have also been applied to the problem. However, they require manually labeled training data.MethodsIn this work, we develop a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data.ResultsWe evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals.
机译:背景技术生物医学文献的快速增长需要准确的文本分析和文本处理工具。检测缩写并确定其定义是此类工具的重要组成部分。缩写定义识别任务的大多数现有方法都采用基于规则的方法。在实现高精度的同时,基于规则的方法仅限于定义的规则,并且无法捕获许多不常见的定义模式。在检测缩写定义方面提供更大灵活性的监督学习技术也已应用于该问题。但是,它们需要手动标记的训练数据。方法在这项工作中,我们开发了一种机器学习算法来识别文本中的缩写定义,该算法利用了我们所谓的自然标记数据。正面训练示例是文本中自然产生的潜在缩写-定义对。通过将潜在的缩写与无关的潜在定义随机混合来生成负训练示例。训练机器学习器以区分这两组示例。然后,将学习到的特征权重用于识别完整的缩写形式。这种方法不需要手动标记训练数据。结果我们评估了算法在Ab3P,BIOADI和Medstract语料库上的性能。我们的系统显示出的结果与现有的Ab3P和BIOADI系统相比非常理想。我们对Ab3P语料库的F量度达到91.36%,对BIOADI语料库的F量度达到87.13%,这要优于Ab3P和BIOADI系统报告的结果。此外,就召回而言,我们的性能优于这些系统,这是我们的目标之一。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号