首页> 外文期刊>Journal of Cheminformatics >A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature
【24h】

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature

机译:生物医学文献中用于化学实体识别的条件随机场和结构化支持向量机的比较

获取原文
           

摘要

Background Chemical compounds and drugs (together called chemical entities) embedded in scientific articles are crucial for many information extraction tasks in the biomedical domain. However, only a very limited number of chemical entity recognition systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of chemical entity recognition systems, the Spanish National Cancer Research Center (CNIO) and The University of Navarra organized a challenge on Chemical and Drug Named Entity Recognition (CHEMDNER). The CHEMDNER challenge contains two individual subtasks: 1) Chemical Entity Mention recognition (CEM); and 2) Chemical Document Indexing (CDI). Our study proposes machine learning-based systems for the CEM task. Methods The 2013 CHEMDNER challenge organizers provided a manually annotated 10,000 UTF8-encoded PubMed abstracts according to a predefined annotation guideline: a training set of 3,500 abstracts, a development set of 3,500 abstracts and a test set of 3,000 abstracts. We developed machine learning-based systems, based on conditional random fields (CRF) and structured support vector machines (SSVM) respectively, for the CEM task for this data set. The effects of three types of word representation (WR) features, generated by Brown clustering, random indexing and skip-gram, on both two machine learning-based systems were also investigated. The performance of our system was evaluated on the test set using scripts provided by the CHEMDNER challenge organizers. Primary evaluation measures were micro Precision, Recall, and F-measure. Results Our best system was among the top ranked systems with an official micro F-measure of 85.05%. Fixing a bug caused by inconsistent features marginally improved the performance (micro F-measure of 85.20%) of the system. Conclusions The SSVM-based CEM systems outperformed the CRF-based CEM systems when using the same features. Each type of the WR feature was beneficial to the CEM task. Both the CRF-based and SSVM-based systems using the all three types of WR features showed better performance than the systems using only one type of the WR feature.
机译:背景技术嵌入科学文章的化合物和药物(统称为化学实体)对于生物医学领域的许多信息提取任务至关重要。但是,公开的化学实体识别系统数量非常有限,这可能是由于缺少大型手动注释的语料库所致。为了加快化学实体识别系统的开发,西班牙国家癌症研究中心(CNIO)和纳瓦拉大学组织了有关化学和药物命名实体识别(CHEMDNER)的挑战。 CHEMDNER挑战包含两个单独的子任务:1)化学实体提及识别(CEM); 2)化学文件索引(CDI)。我们的研究提出了用于CEM任务的基于机器学习的系统。方法2013年CHEMDNER挑战赛的组织者根据预定义的注释准则提供了手动注释的10,000个UTF8编码的PubMed摘要:训练集3500个摘要,开发集3500个摘要和测试集3,000个摘要。我们针对该数据集的CEM任务,分别基于条件随机字段(CRF)和结构化支持向量机(SSVM)开发了基于机器学习的系统。还研究了布朗聚类,随机索引和跳跃图生成的三种类型的单词表示(WR)功能对两个基于机器学习的系统的影响。我们使用CHEMDNER挑战组织者提供的脚本在测试集中评估了我们系统的性能。主要评估指标是微精度,召回率和F指标。结果我们的最佳系统是排名最高的系统,官方微F测度为85.05%。修复因功能不一致而导致的错误,可以稍微改善系统的性能(微型F测度为85.20%)。结论使用相同功能时,基于SSVM的CEM系统优于基于CRF的CEM系统。每种类型的WR功能都有助于CEM任务。与仅使用一种类型的WR功能的系统相比,使用这三种类型的WR功能的基于CRF的系统和基于SSVM的系统都表现出更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号