PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database

机译：PubMed文本相似性模型及其在保护域数据库中的管理工作中的应用

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

This study proposes a text similarity model to help biocuration efforts of the Conserved Domain Database (CDD). CDD is a curated resource that catalogs annotated multiple sequence alignment models for ancient domains and full-length proteins. These models allow for fast searching and quick identification of conserved motifs in protein sequences via Reverse PSI-BLAST. In addition, CDD curators prepare summaries detailing the function of these conserved domains and specific protein families, based on published peer-reviewed articles. To facilitate information access for database users, it is desirable to specifically identify the referenced articles that support the assertions of curator-composed sentences. Moreover, CDD curators desire an alert system that scans the newly published literature and proposes related articles of relevance to the existing CDD records. Our approach to address these needs is a text similarity method that automatically maps a curator-written statement to candidate sentences extracted from the list of referenced articles, as well as the articles in the PubMed Central database. To evaluate this proposal, we paired CDD description sentences with the top 10 matching sentences from the literature, which were given to curators for review. Through this exercise, we discovered that we were able to map the articles in the reference list to the CDD description statements with an accuracy of 77%. In the dataset that was reviewed by curators, we were able to successfully provide references for 86% of the curator statements. In addition, we suggested new articles for curator review, which were accepted by curators to be added into the reference list at an acceptance rate of 50%. Through this process, we developed a substantial corpus of similar sentences from biomedical articles on protein sequence, structure and function research, which constitute the CDD text similarity corpus. This corpus contains 5159 sentence pairs judged for their similarity on a scale from 1 (low) to 5 (high) doubly annotated by four CDD curators. Curator-assigned similarity scores have a Pearson correlation coefficient of 0.70 and an inter-annotator agreement of 85%. To date, this is the largest biomedical text similarity resource that has been manually judged, evaluated and made publicly available to the community to foster research and development of text similarity algorithms.

机译：这项研究提出了一种文本相似性模型，以帮助保守域数据库（CDD）进行生物固化。 CDD是一种精选资源，可为古代域和全长蛋白质编录带注释的多个序列比对模型。这些模型允许通过反向PSI-BLAST快速搜索和快速鉴定蛋白质序列中的保守基序。此外，CDD策展人会根据已发表的同行评审文章，编写摘要，详细说明这些保守域和特定蛋白质家族的功能。为了促进数据库用户的信息访问，希望专门标识支持策展人撰写的句子主张的参考文章。此外，CDD策展人希望有一个警报系统，该系统可以扫描新出版的文献并提出与现有CDD记录相关的相关文章。我们满足这些需求的方法是一种文本相似性方法，该方法会自动将策展人书面声明映射到从参考文章列表以及PubMed Central数据库中的文章中提取的候选句子。为了评估该建议，我们将CDD描述语句与文献中匹配的前10个匹配语句配对，然后将其提供给策展人进行审查。通过此练习，我们发现我们能够将参考列表中的文章映射到CDD描述语句，准确度为77％。在策展人审查的数据集中，我们能够成功地为86％的策展人陈述提供参考。此外，我们建议新的策展人文章供策展人审查，并由策展人接受，以50％的接受率添加到参考书目中。通过这一过程，我们从生物医学文章中开发了有关蛋白质序列，结构和功能研究的大量相似句子，构成了CDD文本相似性语料库。该语料库包含5159个句子对，由四个CDD策展人双重注释，从1（低）到5（高）的等级对其相似性进行判断。策展人分配的相似性评分的皮尔逊相关系数为0.70，注释者之间的一致性为85％。迄今为止，这是最大的生物医学文本相似性资源，已被手动判断，评估并公开提供给社区，以促进文本相似性算法的研究和开发。

著录项

期刊名称 Database: The Journal of Biological Databases and Curation
作者
Rezarta Islamaj; W John Wilbur; Natalie Xie; Noreen R Gonzales; Narmada Thanki; Roxanne Yamashita; Chanjuan Zheng; Aron Marchler-Bauer; Zhiyong Lu;
展开▼
作者单位

展开▼
年(卷),期 2019(2019),-1
年度 2019
页码 baz064
总页数 13
原文格式 PDF
正文语种
中图分类生物学;
关键词
入库时间 2022-08-17 12:16:35

相似文献

外文文献
中文文献
专利

1. Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts [J] . Bethany R. Harris, Chih-Hsuan Wei, Donghui Li, Database . 2012,第40期

机译：使用文本挖掘工具加速文献管理：使用PubTator整理PubMed摘要中的基因的案例研究
2. Manual Curation in the Conserved Domain Database [J] . Gonzales N. R., Chitsaz F., Derbyshire M. K., Protein Science: A Publication of the Protein Society . 2016,第Suppla1期

机译：保守域数据库中的手动管理
3. CDD: a curated entrez database of conserved domain alignments [J] . Aron Marchler-Bauer, John B. Anderson, Carol DeWeese-Scott, Nucleic Acids Research . 2003,第1期

机译：CDD：保守的领域比对的精选entrez数据库
4. Text Mining Technologies for Database Curation [C] . Fabio Rinaldi International Conference on Knowledge Discovery and Information Retrieval . 2014

机译：用于数据库策策的文本挖掘技术
5. Using Text Mining to Accelerate Automatic Curation of Biomedical Databases [D] . Jain, Suvir. 2015

机译：使用Text Mining来加速生物医学数据库的自动策序
6. Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts [O] . Chih-Hsuan Wei, Bethany R. Harris, Donghui Li, 2012

机译：使用文本挖掘工具加速文献管理：以PubTator来管理PubMed摘要中的基因的案例研究
7. CDD: a curated Entrez database of conserved domain alignments [O] . Marchler-Bauer, Aron, Anderson, John B., DeWeese-Scott, Carol, 2003

机译：CDD：精选的Entrez保守域比对数据库

PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database

摘要

著录项

相似文献

相关主题

期刊订阅