首页> 外文期刊>Bioinformatics >Markov model recognition and classification of DNA/protein sequences within large text databases.
【24h】

Markov model recognition and classification of DNA/protein sequences within large text databases.

机译:大文本数据库中的马尔可夫模型识别和DNA /蛋白质序列分类。

获取原文
获取原文并翻译 | 示例
           

摘要

MOTIVATION: Short sequence patterns frequently define regions of biological interest (binding sites, immune epitopes, primers, etc.), yet a large fraction of this information exists only within the scientific literature and is thus difficult to locate via conventional means (e.g. keyword queries or manual searches). We describe herein a system to accurately identify and classify sequence patterns from within large corpora using an n-gram Markov model (MM). RESULTS: As expected, on test sets we found that identification of sequences with limited alphabets and/or regular structures such as nucleic acids (non-ambiguous) and peptide abbreviations (3-letter) was highly accurate, whereas classification of symbolic (1-letter) peptide strings with more complex alphabets was more problematic. The MM was used to analyze two very large, sequence-containing corpora: over 7.75 million Medline abstracts and 9000 full-text articles from Journal of Virology. Performance was benchmarked by comparing the results with Journal of Virology entries in two existing manually curated databases: VirOligo and the HLA Ligand Database. Performance estimates were 98 +/- 2% precision/84% recall for primer identification and classification and 67 +/- 6% precision/85% recall for peptide epitopes. We also find a dramatic difference between the amounts of sequence-related data reported in abstracts versus full text. Our results suggest that automated extraction and classification of sequence elements is a promising, low-cost means of sequence database curation and annotation. AVAILABILITY: MM routine and datasets are available upon request.
机译:动机:短序列模式经常定义生物学上感兴趣的区域(结合位点,免疫表位,引物等),但是这些信息的很大一部分仅存在于科学文献中,因此难以通过常规方式(例如关键字查询)进行定位或手动搜索)。我们在本文中描述了一种使用n-gram马尔可夫模型(MM)从大型语料库中准确识别和分类序列模式的系统。结果:正如预期的那样,在测试集上,我们发现识别具有有限字母和/或规则结构(例如核酸(无歧义)和肽缩写(3个字母))的序列非常准确,而对符号(1-字母)与更复杂的字母的肽字符串存在更大的问题。 MM用于分析两个非常大的,包含序列的语料库:超过775万份Medline摘要和Journal of Virology的9000篇全文文章。通过将结果与两个现有手动管理的数据库(VirOligo和HLA配体数据库)中的Journal of Virology条目进行比较来对性能进行基准测试。对于引物的鉴定和分类,性能估计为98 +/- 2%精度/ 84%召回率,对于肽表位,性能估计为67 +/- 6%精度/ 85%召回率。我们还发现摘要中报告的与序列相关的数据量与全文之间存在显着差异。我们的结果表明,序列元素的自动提取和分类是一种有前途的,低成本的序列数据库管理和注释方法。可用性:MM例程和数据集可根据要求提供。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号