首页> 美国卫生研究院文献>Biodiversity Data Journal >COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

【2h】

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

机译：COPIOUS：命名实体的黄金标准语料库，用于从生物多样性文献中提取物种发生

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities. Results Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences. Conclusion The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity.

机译：背景：物种发生记录在生物多样性领域非常重要。尽管几个可用的语料库仅包含物种名称或栖息地和地理位置的注释，但是没有合并的语料库涵盖从生物多样性文献中提取物种发生所必需的所有类型的实体。为了缓解这个问题，我们构建了COPIOUS语料库-一种涵盖了广泛的生物多样性实体的黄金标准语料库。结果：两个注释者使用五类实体（即分类名称，地理位置，栖息地，时间表达方式和人名）手动注释了主体。关于200个双重注释文档的注释者之间的总体协议约为F分数81.86％。在这五类中，关于栖息地实体的协议是最低的，这表明此类实体是复杂的。 COPIOUS语料库由从生物多样性遗产图书馆下载的668份文档组成，包含超过26K的句子和超过28K的实体。在语料库上受训的命名实体识别器的F分数可达到74.58％。此外，在识别分类单元名称时，我们的模型比生物多样性领域的两个可用工具，即SPECIES标记器和全球名称识别和发现，表现更好。通过将基于模式的关系提取系统应用于黄金标准，可以识别出1600多个生物分类单元，分类单元，分类单元，分类地理位置和分类时间表达。基于提取的关系，我们可以生成物种发生的知识库。结论本文详细描述了生物多样性领域的黄金标准，即实体语料库。对按金标准培训的命名实体识别（NER）工具的性能进行的调查显示，该语料库对于培训和评估目的而言都是足够可靠和可伸缩的。语料库可进一步用于关系提取，以定位文献中物种的出现，这对于监视物种分布和保护生物多样性是一项有用的任务。 展开▼

著录项

期刊名称 Biodiversity Data Journal

作者
Nhung T.H. Nguyen; Roselyn S. Gabud; Sophia Ananiadou;
展开▼

作者单位

展开▼

年(卷),期 2013(),7

年度 2013

页码 e29626

总页数 23

原文格式 PDF

正文语种

中图分类生物学;

关键词
Biodiversity, text mining, named entity recognition, species occurrence, gold standard;

机译：生物多样性;文本挖掘;命名实体识别;物种发生;黄金标准;

入库时间 2022-08-17 14:57:38

相似文献

外文文献

中文文献

专利

1. Myanmar named entity corpus and its use in syllable-based neural named entity recognition [J] . Hsu Myat Mo, Khin Mar Soe International Journal of Electrical and Computer Engineering . 2020,第2期

机译：缅甸名为实体语料库及其在基于音节的神经名为实体识别中的用途

2. Co-occurrence based word representation for extracting named entities in Tamil tweets [J] . Devi G. Remmiya, Kumar M. Anand, Soman K. P. Journal of intelligent & fuzzy systems: Applications in Engineering and Technology . 2018,第3期

机译：基于泰米尔推文中提取命名实体的共同发生的词表示

3. On the Importance of Drill-Down Analysis for Assessing Gold Standards and Named Entity Linking Performance [J] . Fabian Odoni, Philipp Kuntschik, Adrian M.P. Bra?oveanu, Procedia Computer Science . 2018,第1期

机译：深入分析对评估金标准和命名实体链接性能的重要性

4. BlOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature [C] . Sajawel Ahmed, Manuel Stoeckel, Christine Driller, Conference on computational natural language learning . 2019

机译：BlOfid数据集：在历史生物多样性文献中发布有关命名实体识别的德国金标准

5. Arabic Named Entity Recognition: A Corpus-Based Study [D] . Algahtani, Shabib. 2012

机译：阿拉伯语命名实体识别：基于语料库的研究

6. Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements [O] . Todd Lingren, Louise Deleger, Katalin Molnar, 2014

机译：评估预批注对批注速度和潜在偏见的影响：在临床试验公告中为自然语言处理金标准开发的临床命名实体识别

7. Supplementary material 1 from: Nguyen N, Gabud R, Ananiadou S (2019) COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodiversity Data Journal 7: e29626. https://doi.org/10.3897/BDJ.7.e29626 [O] . Nhung Nguyen, Roselyn Gabud, Sophia Ananiadou 2019

机译：补充材料1来自：Nguyen N，Gabud r，Ananiadou S（2019）大量：用于提取物种从生物多样性文献中的命名实体的金标准语料库。生物多样性数据期刊7：E29626。 https://doi.org/10.3897/BDJ.7.e29626

1. 面向儿科疾病的命名实体及实体关系标注语料库构建及应用 [J] . 昝红英 ,刘涛 ,牛常勇 . 中文信息学报 . 2020,第005期

2. 中文电子病历命名实体和实体关系语料库构建 [J] . 杨锦锋 ,关毅 ,何彬 . 软件学报 . 2016,第011期

3. 公路桥梁定期检测领域命名实体识别语料库构建 [J] . 莫天金 ,李韧 ,杨建喜 . 计算机应用 . 2020,第0z1期

4. 中文嵌套命名实体识别语料库的构建 [J] . 李雁群 ,何云琪 ,钱龙华 . 中文信息学报 . 2018,第008期

5. 基于维基百科的中文嵌套命名实体识别语料库自动构建 [J] . 李雁群 ,何云琪 ,钱龙华 . 计算机工程 . 2018,第011期

6. 结构实体混凝土强度"回弹-取芯"检验方法与合格性判定标准介绍 [C] . 周岳年 ,刘开耀 ,葛兆庆 . 2015全国建筑材料测试技术交流会 . 2015

7. 中医中文电子病历命名实体语料库构建及研究 [A] . 刘一斌 . 2020

1. 医学文献中药物命名实体标准化方法、装置、设备和介质 [P] . 中国专利： CN113761929A . 2021-12-07

2. 构建中文命名实体标注的语料库的方法、装置 [P] . 中国专利： CN112182204A . 2021-01-05

3. Translation of entity names based on source document publication date, and frequency and co-occurrence of the entity names [P] . 外国专利： US8706474B2 . 2014-04-22

机译：根据源文档发布日期以及实体名称的出现频率和同时出现来翻译实体名称

4. System and method for creation, representation, and delivery of document corpus entity co-occurrence information [P] . 外国专利： US7587407B2 . 2009-09-08

机译：用于创建，表示和传递文档主体实体共现信息的系统和方法

5. System and method for creation, representation, and delivery of document corpus entity co-occurrence information [P] . 外国专利： US7593940B2 . 2009-09-22

机译：用于创建，表示和传递文档主体实体共现信息的系统和方法

相关主题

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

摘要

著录项

相似文献

相关主题

期刊订阅