...
首页> 外文期刊>Bioinformatics >Automated download and clean-up of family-specific databases for kmer-based virus identification
【24h】

Automated download and clean-up of family-specific databases for kmer-based virus identification

机译:自动下载和清理家庭特定数据库,用于基于库的病毒识别

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Here, we present an automated pipeline for Download Of NCBI Entries (DONE) and continuous updating of a local sequence database based on user-specified queries. The database can be created with either protein or nucleotide sequences containing all entries or complete genomes only. The pipeline can automatically clean the database by removing entries with matches to a database of user-specified sequence contaminants. The default contamination entries include sequences from the UniVec database of plasmids, marker genes and sequencing adapters from NCBI, an E.coli genome, rRNA sequences, vectors and satellite sequences. Furthermore, duplicates are removed and the database is automatically screened for sequences from green fluorescent protein, luciferase and antibiotic resistance genes that might be present in some GenBank viral entries, and could lead to false positives in virus identification. For utilizing the database, we present a useful opportunity for dealing with possible human contamination. We show the applicability of DONE by downloading a virus database comprising 37 virus families. We observed an average increase of 16 776 new entries downloaded per month for the 37 families. In addition, we demonstrate the utility of a custom database compared to a standard reference database for classifying both simulated and real sequence data.
机译:在这里,我们提供了一个自动管道,用于下载NCBI条目(完成),并根据用户指定的查询持续更新本地序列数据库。该数据库可以使用包含所有条目的蛋白质或核苷酸序列创建,也可以仅使用完整的基因组创建。管道可以通过删除与用户指定序列数据库匹配的条目来自动清理数据库。默认的污染条目包括来自UniVec数据库的序列,包括来自NCBI的质粒、标记基因和测序适配器、大肠杆菌基因组、rRNA序列、载体和卫星序列。此外,删除重复序列,并自动筛选数据库中的绿色荧光蛋白、荧光素酶和抗生素抗性基因序列,这些基因可能存在于某些GenBank病毒条目中,并可能导致病毒识别中的假阳性。为了利用数据库,我们为处理可能的人类污染提供了一个有用的机会。通过下载包含37个病毒家族的病毒数据库,我们展示了该方法的适用性。我们观察到37个家庭每月下载的新条目平均增加了16776条。此外,我们还展示了自定义数据库与标准参考数据库相比在分类模拟和真实序列数据方面的实用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号