首页> 外文期刊>Database >A crowdsourcing workflow for extracting chemical-induced disease relations from free text
【24h】

A crowdsourcing workflow for extracting chemical-induced disease relations from free text

机译:从自由文本中提取化学性疾病关系的众包工作流程

获取原文
       

摘要

Relations between chemicals and diseases are one of the most queried biomedical interactions. Although expert manual curation is the standard method for extracting these relations from the literature, it is expensive and impractical to apply to large numbers of documents, and therefore alternative methods are required. We describe here a crowdsourcing workflow for extracting chemical-induced disease relations from free text as part of the BioCreative V Chemical Disease Relation challenge. Five non-expert workers on the CrowdFlower platform were shown each potential chemical-induced disease relation highlighted in the original source text and asked to make binary judgments about whether the text supported the relation. Worker responses were aggregated through voting, and relations receiving four or more votes were predicted as true. On the official evaluation dataset of 500 PubMed abstracts, the crowd attained a 0.505 F-score (0.475 precision, 0.540 recall), with a maximum theoretical recall of 0.751 due to errors with named entity recognition. The total crowdsourcing cost was $1290.67 ($2.58 per abstract) and took a total of 7?h. A qualitative error analysis revealed that 46.66% of sampled errors were due to task limitations and gold standard errors, indicating that performance can still be improved. All code and results are publicly available at https://github.com/SuLab/crowd_cid_relex Database URL: https://github.com/SuLab/crowd_cid_relex
机译:化学物质与疾病之间的关系是最受质疑的生物医学相互作用之一。尽管专家手动策展是从文献中提取这些关系的标准方法,但是将其应用于大量文档既昂贵又不切实际,因此需要其他方法。我们在这里描述了一个众包工作流程,用于从自由文本中提取化学引起的疾病关系,这是BioCreative V化学疾病关系挑战的一部分。向CrowdFlower平台上的五名非专家工作人员展示了原始源文本中突出显示的每种潜在的化学诱导的疾病关系,并要求对文本是否支持该关系做出二进制判断。工人的反应是通过投票汇总的,预计获得四票或更多票的关系是正确的。在500个PubMed摘要的官方评估数据集上,由于命名实体识别的错误,该人群获得了0.505 F分数(0.475精度,0.540回忆),最大理论回忆为0.751。众包的总成本为1290.67美元(每个摘要2.58美元),共耗时7小时。定性错误分析显示,有46.66%的抽样错误是由于任务限制和金标准错误所致,表明性能仍可得到改善。所有代码和结果均可在https://github.com/SuLab/crowd_cid_relex上公开获得。数据库URL:https://github.com/SuLab/crowd_cid_relex

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号