首页> 外文会议>International Conference on Bioinformatics and Computational Biology >A framework for a general-purpose sequence compression pipeline: a centroid based compression
【24h】

A framework for a general-purpose sequence compression pipeline: a centroid based compression

机译:通用序列压缩管道的框架:基于质心的压缩

获取原文

摘要

DNA sequence data accumulate at an overwhelmingly fast speed, overtaking the speed of the increase of disk storage and creating enormous challenges to data storage, processing, and analysis. Taking advantage of the fact that two human genomes differ by less than 0.1%, we and other groups previously proposed a reference based compression algorithm to compress genomic data. However, the reference based sequence compression only works when there is a reference genome. Many large-scale sequencing projects such as metagenomics data do not have any reference genomes readily available. Therefore, we need a compression method that can be applied in these cases. This project addresses the problem by introducing a centroid based compression algorithm. The centroid based compression algorithm involves taking in large-scale next generation sequencing data and clustering similar sequences into groups. Within each group a "centroid" sequence is identified, and the differences that each sequence has from its respective centroid sequence is encoded. Results show that the method is advantageous when there exists many redundant sequences within the dataset - in particular, the high coverage nature of next generation sequencing data and meta-genomics data. The framework developed here is for a general-purpose compression pipeline that can be theoretically applied to many cases.
机译:DNA序列数据以压倒性的快速累积,超越磁盘存储的速度,并为数据存储,处理和分析产生巨大挑战。利用以下事实:两个人类基因因子不同小于0.1%,我们先前提出了基于基于参考的压缩算法来压缩基因组数据。然而,基于参考的序列压缩仅在存在参考基因组时工作。许多大规模测序项目,如偏心组织数据没有容易获得的任何参考基因。因此,我们需要一种可以在这些情况下应用的压缩方法。该项目通过引入基于质心的压缩算法来解决问题。基于质心的压缩算法涉及在大规模的下一代测序数据和将相似序列聚类为组。在每个组内,识别出“质心”序列,并且每个序列具有其各自的质心序列的差异被编码。结果表明,当数据集中存在许多冗余序列时,该方法是有利的 - 特别是下一代测序数据和元基因组数据的高覆盖性质。这里开发的框架是用于理论上可以应用于许多情况的通用压缩管道。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号