首页> 美国卫生研究院文献>Database: The Journal of Biological Databases and Curation >PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database
【2h】

PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database

机译:PubMed文本相似性模型及其在保护域数据库中的管理工作中的应用

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

This study proposes a text similarity model to help biocuration efforts of the Conserved Domain Database (CDD). CDD is a curated resource that catalogs annotated multiple sequence alignment models for ancient domains and full-length proteins. These models allow for fast searching and quick identification of conserved motifs in protein sequences via Reverse PSI-BLAST. In addition, CDD curators prepare summaries detailing the function of these conserved domains and specific protein families, based on published peer-reviewed articles. To facilitate information access for database users, it is desirable to specifically identify the referenced articles that support the assertions of curator-composed sentences. Moreover, CDD curators desire an alert system that scans the newly published literature and proposes related articles of relevance to the existing CDD records. Our approach to address these needs is a text similarity method that automatically maps a curator-written statement to candidate sentences extracted from the list of referenced articles, as well as the articles in the PubMed Central database. To evaluate this proposal, we paired CDD description sentences with the top 10 matching sentences from the literature, which were given to curators for review. Through this exercise, we discovered that we were able to map the articles in the reference list to the CDD description statements with an accuracy of 77%. In the dataset that was reviewed by curators, we were able to successfully provide references for 86% of the curator statements. In addition, we suggested new articles for curator review, which were accepted by curators to be added into the reference list at an acceptance rate of 50%. Through this process, we developed a substantial corpus of similar sentences from biomedical articles on protein sequence, structure and function research, which constitute the CDD text similarity corpus. This corpus contains 5159 sentence pairs judged for their similarity on a scale from 1 (low) to 5 (high) doubly annotated by four CDD curators. Curator-assigned similarity scores have a Pearson correlation coefficient of 0.70 and an inter-annotator agreement of 85%. To date, this is the largest biomedical text similarity resource that has been manually judged, evaluated and made publicly available to the community to foster research and development of text similarity algorithms.
机译:这项研究提出了一种文本相似性模型,以帮助保守域数据库(CDD)进行生物固化。 CDD是一种精选资源,可为古代域和全长蛋白质编录带注释的多个序列比对模型。这些模型允许通过反向PSI-BLAST快速搜索和快速鉴定蛋白质序列中的保守基序。此外,CDD策展人会根据已发表的同行评审文章,编写摘要,详细说明这些保守域和特定蛋白质家族的功能。为了促进数据库用户的信息访问,希望专门标识支持策展人撰写的句子主张的参考文章。此外,CDD策展人希望有一个警报系统,该系统可以扫描新出版的文献并提出与现有CDD记录相关的相关文章。我们满足这些需求的方法是一种文本相似性方法,该方法会自动将策展人书面声明映射到从参考文章列表以及PubMed Central数据库中的文章中提取的候选句子。为了评估该建议,我们将CDD描述语句与文献中匹配的前10个匹配语句配对,然后将其提供给策展人进行审查。通过此练习,我们发现我们能够将参考列表中的文章映射到CDD描述语句,准确度为77%。在策展人审查的数据集中,我们能够成功地为86%的策展人陈述提供参考。此外,我们建议新的策展人文章供策展人审查,并由策展人接受,以50%的接受率添加到参考书目中。通过这一过程,我们从生物医学文章中开发了有关蛋白质序列,结构和功能研究的大量相似句子,构成了CDD文本相似性语料库。该语料库包含5159个句子对,由四个CDD策展人双重注释,从1(低)到5(高)的等级对其相似性进行判断。策展人分配的相似性评分的皮尔逊相关系数为0.70,注释者之间的一致性为85%。迄今为止,这是最大的生物医学文本相似性资源,已被手动判断,评估并公开提供给社区,以促进文本相似性算法的研究和开发。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号