...
首页> 外文期刊>Journal of biomedical informatics. >Comparison of character-level and part of speech features for name recognition in biomedical texts.
【24h】

Comparison of character-level and part of speech features for name recognition in biomedical texts.

机译:生物医学文本中名称识别的字符级和语音特征的比较。

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

The immense volume of data which is now available from experiments in molecular biology has led to an explosion in reported results most of which are available only in unstructured text format. For this reason there has been great interest in the task of text mining to aid in fact extraction, document screening, citation analysis, and linkage with large gene and gene-product databases. In particular there has been an intensive investigation into the named entity (NE) task as a core technology in all of these tasks which has been driven by the availability of high volume training sets such as the GENIA v3.02 corpus. Despite such large training sets accuracy for biology NE has proven to be consistently far below the high levels of performance in the news domain where F scores above 90 are commonly reported which can be considered near to human performance. We argue that it is crucial that more rigorous analysis of the factors that contribute to the model's performance be applied to discover where the underlying limitations are and what our future research direction should be. Our investigation in this paper reports on variations of two widely used feature types, part of speech (POS) tags and character-level orthographic features, and makes a comparison of how these variations influence performance. We base our experiments on a proven state-of-the-art model, support vector machines using a high quality subset of 100 annotated MEDLINE abstracts. Experiments reveal that the best performing features are orthographic features with F score of 72.6. Although the Brill tagger trained in-domain on the GENIA v3.02p POS corpus gives the best overall performance of any POS tagger, at an F score of 68.6, this is still significantly below the orthographic features. In combination these two features types appear to interfere with each other and degrade performance slightly to an F score of 72.3.
机译:现在可以从分子生物学实验中获得的大量数据导致报告结果的爆炸式增长,其中大多数只能以非结构化文本格式获得。由于这个原因,人们对文本挖掘的任务非常感兴趣,以帮助进行事实提取,文档筛选,引文分析以及与大型基因和基因产物数据库的链接。尤其是,已对命名实体(NE)任务作为所有这些任务中的核心技术进行了深入研究,这是由诸如GENIA v3.02语料库之类的大量培训集提供的。尽管训练量如此之大,但生物学NE的准确性已被证明始终低于新闻领域的高水平表现,在新闻领域,F分数通常被报道超过90,可以认为接近人类的表现。我们认为至关重要的是,必须对构成模型性能的因素进行更严格的分析,以发现潜在的局限性在哪里以及我们未来的研究方向应该是什么。我们在本文中的调查报告了两种广泛使用的特征类型(语音部分(POS)标签和字符级正交特征)的变化,并对这些变化如何影响性能进行了比较。我们的实验基于经过验证的最新模型,使用100个带注释的MEDLINE摘要的高质量子集的支持向量机。实验表明,表现最好的特征是F得分为72.6的正字特征。尽管在GENIA v3.02p POS语料库上经过Brill标记器训练的域内提供了所有POS标记器最佳的整体性能,但F得分为68.6,这仍然远低于正交特征。结合使用这两种功能类型似乎会互相干扰并使性能稍微下降,使F得分达到72.3。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号