...
首页> 外文期刊>Journal of medical Internet research >Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing
【24h】

Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing

机译:基于Web 2.0的众包可在临床自然语言处理中开发高质量的金标准

获取原文
           

摘要

Background: A high-quality gold standard is vital for supervised, machine learning-based, clinical natural language processing (NLP) systems. In clinical NLP projects, expert annotators traditionally create the gold standard. However, traditional annotation is expensive and time-consuming. To reduce the cost of annotation, general NLP projects have turned to crowdsourcing based on Web 2.0 technology, which involves submitting smaller subtasks to a coordinated marketplace of workers on the Internet. Many studies have been conducted in the area of crowdsourcing, but only a few have focused on tasks in the general NLP field and only a handful in the biomedical domain, usually based upon very small pilot sample sizes. In addition, the quality of the crowdsourced biomedical NLP corpora were never exceptional when compared to traditionally-developed gold standards. The previously reported results on medical named entity annotation task showed a 0.68 F-measure based agreement between crowdsourced and traditionally-developed corpora.Objective: Building upon previous work from the general crowdsourcing research, this study investigated the usability of crowdsourcing in the clinical NLP domain with special emphasis on achieving high agreement between crowdsourced and traditionally-developed corpora.Methods: To build the gold standard for evaluating the crowdsourcing workers’ performance, 1042 clinical trial announcements (CTAs) from the ClinicalTrials.gov website were randomly selected and double annotated for medication names, medication types, and linked attributes. For the experiments, we used CrowdFlower, an Amazon Mechanical Turk-based crowdsourcing platform. We calculated sensitivity, precision, and F-measure to evaluate the quality of the crowd’s work and tested the statistical significance (P<.001, chi-square test) to detect differences between the crowdsourced and traditionally-developed annotations.Results: The agreement between the crowd’s annotations and the traditionally-generated corpora was high for: (1) annotations (0.87, F-measure for medication names; 0.73, medication types), (2) correction of previous annotations (0.90, medication names; 0.76, medication types), and excellent for (3) linking medications with their attributes (0.96). Simple voting provided the best judgment aggregation approach. There was no statistically significant difference between the crowd and traditionally-generated corpora. Our results showed a 27.9% improvement over previously reported results on medication named entity annotation task.Conclusions: This study offers three contributions. First, we proved that crowdsourcing is a feasible, inexpensive, fast, and practical approach to collect high-quality annotations for clinical text (when protected health information was excluded). We believe that well-designed user interfaces and rigorous quality control strategy for entity annotation and linking were critical to the success of this work. Second, as a further contribution to the Internet-based crowdsourcing field, we will publicly release the JavaScript and CrowdFlower Markup Language infrastructure code that is necessary to utilize CrowdFlower’s quality control and crowdsourcing interfaces for named entity annotations. Finally, to spur future research, we will release the CTA annotations that were generated by traditional and crowdsourced approaches.
机译:背景:高质量的金标准对于基于监​​督的基于机器学习的临床自然语言处理(NLP)系统至关重要。在临床NLP项目中,专家注释者通常会创建黄金标准。但是,传统的注释既昂贵又费时。为了减少注释的成本,常规的NLP项目已转向基于Web 2.0技术的众包,其中涉及将较小的子任务提交给Internet上协调的工人市场。在众包领域已经进行了许多研究,但只有少数研究集中在一般NLP领域的任务上,而在生物医学领域则很少,通常基于非常小的试点样本。此外,与传统制定的黄金标准相比,众包生物医学NLP语料库的质量从来都不例外。先前报道的关于医学命名实体注释任务的结果表明,众包和传统开发的语料库之间基于0.68 F度量的协议。目的:基于一般众包研究的先前工作,本研究调查了众包在临床NLP领域的可用性方法:为了建立评估众包工作者绩效的金标准,从ClinicalTrials.gov网站上随机选择了1042个临床试验公告(CTA),并对其进行了双注药物名称,药物类型和链接的属性。对于实验,我们使用了基于Amazon Mechanical Turk的众包平台CrowdFlower。我们计算了敏感度,精确度和F量度以评估人群的工作质量,并测试了统计显着性(P <.001,卡方检验),以检测人群来源的注解和传统开发的注解之间的差异。人群的注释和传统生成的语料库之间的差异很大:(1)注释(0.87,药物名称的F度量; 0.73,药物类型),(2)先前注释的校正(0.90,药物名称; 0.76,药物)类型),并且非常适合(3)将药物与其属性(0.96)联系起来。简单的投票提供了最佳的判断汇总方法。人群和传统生成的语料库之间没有统计学上的显着差异。我们的结果显示,与先前报道的名为“实体注释任务”的药物相比,结果改善了27.9%。结论:本研究提供了三点贡献。首先,我们证明了众包是一种收集临床文本高质量注释的可行,廉价,快速且实用的方法(当排除受保护的健康信息时)。我们认为,精心设计的用户界面和严格的实体注释和链接质量控制策略对于这项工作的成功至关重要。其次,作为对基于Internet的众包领域的进一步贡献,我们将公开发布JavaScript和CrowdFlower标记语言基础结构代码,这对于利用CrowdFlower的质量控制和众包接口来命名实体注释是必不可少的。最后,为了刺激未来的研究,我们将发布由传统和众包方法生成的CTA注释。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号