首页> 美国卫生研究院文献>other >Annotated Chemical Patent Corpus: A Gold Standard for Text Mining
【2h】

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

机译:带注释的化学专利语料库:文本挖掘的黄金标准

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at .
机译:探索专利申请所涵盖的化学和生物空间对于早期药物化学活动至关重要。专利分析可以提供对化合物现有技术的了解,新颖性检查,生物学分析的验证以及化学探索新起点的识别。通过专家策展人的人工提取来从专利中提取化学和生物实体可能会花费大量的时间和资源。文本挖掘方法可以帮助简化此过程。为了验证此类方法的性能,必须手动添加注释的专利语料库。在这项研究中,我们产生了一个大型的金标准化学专利文献集。我们制定了注释准则,并从世界知识产权组织,美国专利商标局和欧洲专利局中选择了200项完整专利。专利会自动进行预注释,并提供给四个独立的注释器组,每个组由2到10个注释器组成。注释者在不同的亚类,疾病,目标和作用方式中标记了化学物质。还注释了由于光学字符识别错误而引起的拼写错误和虚假换行。至少三个注释者组对47项专利的子集进行了注释,从中得出了统一的注释和注释者之间的协议分数。一组注释了全套。专利文献集包括完整集的400,125条注释和统一集的36,537条注释。所有专利和带注释的实体均可在处公开获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号