首页> 外文会议>Data integration in the life sciences >The Cinderella of Biological Data Integration:Addressing Some of the Challenges of Entity and Relationship Mining from Patent Sources
【24h】

The Cinderella of Biological Data Integration:Addressing Some of the Challenges of Entity and Relationship Mining from Patent Sources

机译:生物数据集成的灰姑娘:应对来自专利来源的实体和关系挖掘的一些挑战

获取原文
获取原文并翻译 | 示例

摘要

Most of the global corpus of medicinal chemistry data is only published in patents. However, extracting this from patent documents and subsequent integration with literature and database sources poses unique challenges. This work presents the investigation of an extensive full-text patent resource, including automated name-to-chemical structure conversion, licensed by AstraZeneca via a consortium arrangement with IBM. Our initial focus was identifying protein targets in patent titles linked to extracted bioactive compounds. We benchmarked target recognition strategies against target-assay-compound relationships manually curated from patents by GVKBIO. By analysis of word frequencies and protein names we assessed the false-negative problem of targets not specified in titles and false-positives from non-target proteins in titles. We also examined the time-signals for selected target and non-target names by year of patent publication. Our results exemplify problems and some solutions for extracting data from this source.
机译:全球大多数药物化学数据集仅在专利中公开。然而,从专利文献中提取该信息并随后与文献和数据库资源整合会带来独特的挑战。这项工作提出了对广泛的全文专利资源的调查,包括由AstraZeneca通过与IBM达成的财团协议许可的自动名称到化学结构转换。我们最初的重点是在与提取的生物活性化合物相关的专利标题中确定蛋白质靶标。我们将目标识别策略与GVKBIO从专利手动策划的目标测定-化合物关系进行了基准测试。通过分析单词频率和蛋白质名称,我们评估了标题中未指定的靶标的假阴性问题和标题中非靶标蛋白的假阳性结果。我们还按专利发布年份检查了选定目标名称和非目标名称的时间信号。我们的结果例证了从该来源提取数据的问题和一些解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号