首页> 外文期刊>Journal of Biomedical Semantics >Detecting concept mentions in biomedical text using hidden Markov model: multiple concept types at once or one at a time?
【24h】

Detecting concept mentions in biomedical text using hidden Markov model: multiple concept types at once or one at a time?

机译:使用隐藏的马尔可夫模型检测生物医学文本中的概念提及:一次还是一次选择多个概念类型?

获取原文
           

摘要

Background Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance. Results Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy. Conclusions The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.
机译:背景技术识别引用特定概念类型的短语是从文档中提取信息的关键步骤。提供带注释的文档作为培训数据,有监督的机器学习可以使此过程自动化。在为此任务构建机器学习模型时,可以构建模型以同时检测所有类型(一次所有类型),也可以一次为一种或几种选定类型(一种类型或一种)构建模型。一次有几种类型)。研究哪种策略可以产生更好的检测性能是很有意义的。结果在注释了三种概念类型的临床语料库(i2b2 / VA语料库)和注释了五种概念类型的生物学文献语料库(JNLPBA语料库)上,评估了使用不同策略的隐马尔可夫模型。进行了十次交叉验证测试,实验结果表明,针对多个概念类型训练的模型始终比针对单个概念类型训练的模型产生更好的性能。在前两种策略中观察到的F分数在i2b2 / VA语料库中观察到的F分数要比在后者中观察到的F分数高,在JNLPBA语料库中则是1.4到10.1%,这取决于目标概念的类型。对于所有类型一次策略,改进了边界检测并减少了类型混乱。结论当前的结果表明,通过同时处理多种概念类型可以改善概念短语的检测。这也建议我们在为机器学习模型开发新的语料库时应注释多种概念类型。当考虑多种概念类型时,预期将进行进一步的研究以获取实现良好性能的基本机制。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号