首页> 外文期刊>BMC Bioinformatics >Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi
【24h】

Approaching the taxonomic affiliation of unidentified sequences in public databases – an example from the mycorrhizal fungi

机译:在公共数据库中处理未识别序列的分类隶属关系–以菌根真菌为例

获取原文
           

摘要

Background During the last few years, DNA sequence analysis has become one of the primary means of taxonomic identification of species, particularly so for species that are minute or otherwise lack distinct, readily obtainable morphological characters. Although the number of sequences available for comparison in public databases such as GenBank increases exponentially, only a minuscule fraction of all organisms have been sequenced, leaving taxon sampling a momentous problem for sequence-based taxonomic identification. When querying GenBank with a set of unidentified sequences, a considerable proportion typically lack fully identified matches, forming an ever-mounting pile of sequences that the researcher will have to monitor manually in the hope that new, clarifying sequences have been submitted by other researchers. To alleviate these concerns, a project to automatically monitor select unidentified sequences in GenBank for taxonomic progress through repeated local BLAST searches was initiated. Mycorrhizal fungi – a field where species identification often is prohibitively complex – and the much used ITS locus were chosen as test bed. Results A Perl script package called emerencia is presented. On a regular basis, it downloads select sequences from GenBank, separates the identified sequences from those insufficiently identified, and performs BLAST searches between these two datasets, storing all results in an SQL database. On the accompanying web-service http://emerencia.math.chalmers.se , users can monitor the taxonomic progress of insufficiently identified sequences over time, either through active searches or by signing up for e-mail notification upon disclosure of better matches. Other search categories, such as listing all insufficiently identified sequences (and their present best fully identified matches) publication-wise, are also available. Discussion The ever-increasing use of DNA sequences for identification purposes largely falls back on the assumption that public sequence databases contain a thorough sampling of taxonomically well-annotated sequences. Taxonomy, held by some to be an old-fashioned trade, has accordingly never been more important. emerencia does not automate the taxonomic process, but it does allow researchers to focus their efforts elsewhere than countless manual BLAST runs and arduous sieving of BLAST hit lists. The emerencia system is available on an open source basis for local installation with any organism and gene group as targets.
机译:背景技术在过去的几年中,DNA序列分析已成为对物种进行分类学鉴定的主要手段之一,尤其是对于微小的物种或缺乏独特的,易于获得的形态特征的物种。尽管在诸如GenBank等公共数据库中可用于比较的序列数量呈指数级增长,但仅对所有生物的微小部分进行了测序,从而使分类群采样成为基于序列的分类学鉴定的重要问题。当用一组未识别的序列查询GenBank时,相当大的一部分通常缺少完全识别的匹配项,从而形成了越来越多的序列,研究人员将不得不手动监视这些序列,以期希望其他研究人员已经提交了新的,清晰的序列。为了减轻这些担忧,启动了一个项目,该项目通过重复的本地BLAST搜索自动监控GenBank中未识别序列的分类进展。菌根真菌-一个物种鉴定常常非常复杂的领域-并选择了使用频繁的ITS基因座作为试验床。结果提供了一个称为emerencia的Perl脚本包。它定期从GenBank下载选择序列,将识别出的序列与识别不足的序列分离,并在这两个数据集之间执行BLAST搜索,并将所有结果存储在SQL数据库中。在随附的Web服务http://emerencia.math.chalmers.se上,用户可以通过主动搜索或通过在发现更好的匹配项时注册电子邮件通知来监视未充分识别的序列随时间的分类进展。也可以使用其他搜索类别,例如在发布方面列出所有未充分识别的序列(及其目前最好的完全识别的匹配项)。讨论出于鉴定目的,DNA序列的越来越多的使用很大程度上是基于以下假设:公共序列数据库包含对分类学上有良好注释的序列的完整采样。因此,分类法被认为是一种老式的交易,因此从未如此重要。 Emerencia不会使分类过程自动化,但它确实允许研究人员将工作重点放在无数的手动BLAST运行和艰巨的BLAST命中名单筛选上。 Emerencia系统可在开源的基础上进行本地安装,以任何生物和基因组为目标。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号