首页> 外文期刊>BMC Bioinformatics >New directions in biomedical text annotation: definitions, guidelines and corpus construction
【24h】

New directions in biomedical text annotation: definitions, guidelines and corpus construction

机译:生物医学文本注释的新方向:定义,指南和语料库构建

获取原文
       

摘要

Background While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. Results We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. Conclusion We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available.
机译:背景技术虽然生物医学文本挖掘已成为一个重要的研究领域,但事实证明,难以取得实际成果。我们认为,朝着更准确的文本挖掘迈出的重要第一步在于识别和表征满足各种类型信息需求的文本的能力。我们在此报告对科学文本属性的查询结果,这些属性具有足够的通用性,可以超越狭窄学科领域的局限性,同时支持对文本进行实际挖掘以获取事实信息。我们的最终目标是注释大量的生物医学文本,并训练机器学习方法,以按照我们定义的特定维度对此类文本进行自动分类。结果我们确定了五个定性维度,我们认为它们代表了广泛的科学句子,因此可用于支持文本挖掘的一般方法:重点,极性,确定性,证据和方向性。我们定义了这些尺寸,并描述了我们为注释文字而开发的指南。为了检查指南的有效性,十二个注释者分别对从当前生物医学期刊中随机选择的同一组101个句子进行了注释。对这些注释的分析表明,注释者之间存在70-80%的一致性,这表明我们的指南确实提出了定义明确,可执行和可复制的任务。结论我们提出了定义文本注释任务的准则,以及来自多个独立产生的注释的注释结果,证明了该任务的可行性。目前正在按照这些准则对大量文档集进行注释。这些注释构成了沿多个维度对文本进行分类的基础,以支持对实验结果,方法论陈述和其他形式的信息进行可行的文本挖掘。我们目前正在开发机器学习方法,以在带注释的语料库上进行训练和测试,该方法将允许按照我们提出的一般维度对生物医学文本进行自动分类。指南的详细信息以及带注释的示例可公开获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号