首页> 美国卫生研究院文献>Biodiversity Data Journal >COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature
【2h】

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

机译:COPIOUS:命名实体的黄金标准语料库,用于从生物多样性文献中提取物种发生

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Background Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities. Results Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences. Conclusion The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity.
机译:背景:物种发生记录在生物多样性领域非常重要。尽管几个可用的语料库仅包含物种名称或栖息地和地理位置的注释,但是没有合并的语料库涵盖从生物多样性文献中提取物种发生所必需的所有类型的实体。为了缓解这个问题,我们构建了COPIOUS语料库-一种涵盖了广泛的生物多样性实体的黄金标准语料库。 结果:两个注释者使用五类实体(即分类名称,地理位置,栖息地,时间表达方式和人名)手动注释了主体。关于200个双重注释文档的注释者之间的总体协议约为F分数81.86%。在这五类中,关于栖息地实体的协议是最低的,这表明此类实体是复杂的。 COPIOUS语料库由从生物多样性遗产图书馆下载的668份文档组成,包含超过26K的句子和超过28K的实体。在语料库上受训的命名实体识别器的F分数可达到74.58%。此外,在识别分类单元名称时,我们的模型比生物多样性领域的两个可用工具,即SPECIES标记器和全球名称识别和发现,表现更好。通过将基于模式的关系提取系统应用于黄金标准,可以识别出1600多个生物分类单元,分类单元,分类单元,分类地理位置和分类时间表达。基于提取的关系,我们可以生成物种发生的知识库。 结论本文详细描述了生物多样性领域的黄金标准,即实体语料库。对按金标准培训的命名实体识别(NER)工具的性能进行的调查显示,该语料库对于培训和评估目的而言都是足够可靠和可伸缩的。语料库可进一步用于关系提取,以定位文献中物种的出现,这对于监视物种分布和保护生物多样性是一项有用的任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号