New directions in biomedical text annotation: definitions, guidelines and corpus construction

W John Wilbur; Andrey Rzhetsky; Hagit Shatkay

首页> 外文期刊>BMC Bioinformatics >New directions in biomedical text annotation: definitions, guidelines and corpus construction

【24h】

New directions in biomedical text annotation: definitions, guidelines and corpus construction

机译：生物医学文本注释的新方向：定义，指南和语料库构建

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. Results We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. Conclusion We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available.

机译：背景技术虽然生物医学文本挖掘已成为一个重要的研究领域，但事实证明，难以取得实际成果。我们认为，朝着更准确的文本挖掘迈出的重要第一步在于识别和表征满足各种类型信息需求的文本的能力。我们在此报告对科学文本属性的查询结果，这些属性具有足够的通用性，可以超越狭窄学科领域的局限性，同时支持对文本进行实际挖掘以获取事实信息。我们的最终目标是注释大量的生物医学文本，并训练机器学习方法，以按照我们定义的特定维度对此类文本进行自动分类。结果我们确定了五个定性维度，我们认为它们代表了广泛的科学句子，因此可用于支持文本挖掘的一般方法：重点，极性，确定性，证据和方向性。我们定义了这些尺寸，并描述了我们为注释文字而开发的指南。为了检查指南的有效性，十二个注释者分别对从当前生物医学期刊中随机选择的同一组101个句子进行了注释。对这些注释的分析表明，注释者之间存在70-80％的一致性，这表明我们的指南确实提出了定义明确，可执行和可复制的任务。结论我们提出了定义文本注释任务的准则，以及来自多个独立产生的注释的注释结果，证明了该任务的可行性。目前正在按照这些准则对大量文档集进行注释。这些注释构成了沿多个维度对文本进行分类的基础，以支持对实验结果，方法论陈述和其他形式的信息进行可行的文本挖掘。我们目前正在开发机器学习方法，以在带注释的语料库上进行训练和测试，该方法将允许按照我们提出的一般维度对生物医学文本进行自动分类。指南的详细信息以及带注释的示例可公开获得。

著录项

来源
《BMC Bioinformatics》 |2006年第1期|共页
作者
W John Wilbur; Andrey Rzhetsky; Hagit Shatkay;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生物科学;
关键词

相似文献

外文文献
中文文献
专利

1. New directions in biomedical text annotation: definitions, guidelines and corpus construction [J] . W John Wilbur, Andrey Rzhetsky, Hagit Shatkay BMC Bioinformatics . 2006,第1期

机译：生物医学文本注释的新方向：定义，指南和语料库构建
2. Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles [J] . K. Bretonnel Cohen, Arrick Lanfranchi, Miji Joo-young Choi, BMC Bioinformatics . 2017,第1期

机译：生物医学期刊文章的科罗拉多州富注释全文（CRAFT）语料库中的共指注释和分辨率
3. Design and Annotation of MultiMedica – A Multilingual Text Corpus of the Biomedical Domain [J] . Antonio Moreno-Sandoval, Leonardo Campillos-Llanos Procedia - Social and Behavioral Sciences . 2013,第2期

机译：MultiMedica的设计和注释–生物医学领域的多语言文本语料库
4. Iterative development of family history annotation guidelines using a synthetic corpus of clinical text [C] . Taraka Rama, Pal H. Brekke, Oystein Nytro, Conference on empirical methods in natural language processing;International workshop on health text mining and information analysis . 2018

机译：使用临床文本的合成语料库迭代发展家族史注释准则
5. Methods for Extending Biomedical Reference Ontologies and Interface Terminologies for EHR Text Annotation [D] . Kuttichi Keloth, Vipina. 2021

机译：用于扩展生物医学参考本体和界面术语的方法，用于EHR文本注释
6. New directions in biomedical text annotation: definitions guidelines and corpus construction [O] . W John Wilbur, Andrey Rzhetsky, Hagit Shatkay 2006

机译：生物医学文本注释的新方向：定义指南和语料库构建
7. New directions in biomedical text annotation: definitions, guidelines and corpus construction [O] . Rzhetsky Andrey, Wilbur W John, Shatkay Hagit 2006

机译：生物医学文本注释的新方向：定义，指南和语料库构建

New directions in biomedical text annotation: definitions, guidelines and corpus construction

摘要

著录项

相似文献

相关主题

期刊订阅