首页> 美国卫生研究院文献>Bioinformatics >SEED: efficient clustering of next-generation sequences
【2h】

SEED: efficient clustering of next-generation sequences

机译:SEED:下一代序列的有效聚类

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

>Motivation: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.>Results: Here, we introduce SEED—an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60–85% and 21–41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12–27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.>Availability: The SEED software can be downloaded for free from this site: .>Contact: >Supplementary information: are available at Bioinformatics online
机译:>动机:下一代序列(NGS)的相似性聚类是研究DNA / RNA分子种群大小并减少NGS数据冗余的重要计算问题。当前,大多数序列聚类算法都受到其速度和可扩展性的限制,因此无法处理具有数千万次读取的数据。>结果:在这里,我们介绍SEED —一种有效的算法,用于聚类非常大的NGS集。它将序列连接到簇中,这些簇与虚拟中心的区别最多可以是三个错配和三个突出的残基。它基于一种改进的间隔种子方法,称为块间隔种子。它的聚类组件通过首先识别虚拟中心序列,然后找到所有满足相似性参数的相邻序列,对哈希表进行操作。 SEED可以在不到4小时的时间内将1亿个短读序列聚类,并且具有线性时间和存储性能。当使用SEED作为基因组/转录组装配数据的预处理工具时,它能够将Velvet / Oasis装配器在本研究中使用的数据集的时间和内存需求分别减少60-85%和21-41%。此外,程序集包含的重叠群比未预处理的数据更长,如N50值大12–27%所示。与其他聚类工具相比,SEED在生成NGS数据的聚类中表现出最佳性能,与真实聚类结果相似,时间性能提高了2到10倍。尽管SEED的大多数实用程序都属于NGS数据的预处理区域,但我们的测试也证明了其作为独立工具从未排序生物中发现NGS数据中小RNA序列簇的效率。>可用性:可以从以下站点免费下载该软件:。>联系方式: >补充信息:可从Bioinformatics在线获得

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号