首页> 外文期刊>BMC Bioinformatics >Fast lossless compression via cascading Bloom filters
【24h】

Fast lossless compression via cascading Bloom filters

机译:通过级联Bloom滤波器实现快速无损压缩

获取原文
           

摘要

Background Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes. Results We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters. Conclusions Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.
机译:来自大型下一代测序(NGS)实验的背景数据在与存储相关的成本以及文件传输所需的时间方面都提出了挑战。有时可能仅存储与特定应用程序相关的摘要,但是通常希望保留将来重新访问实验结果所需的所有信息。因此,需要用于NGS读取的有效的无损压缩方法。已经显示,NGS特定的压缩方案可以比通用压缩方法(例如Lempel-Ziv算法,Burrows-Wheeler变换或算术编码)提高结果。当参考基因组可用时,可以通过以下方法实现有效的压缩:首先将读数与参考基因组比对,然后使用比对位置结合读数相对于参考的差异,对每个读数进行编码。这些基于参考的方法已显示出比无参考方案更好的压缩效果,但是它们的对齐步骤需要在典型数据集上花费几小时的CPU时间,而无参考方法通常可以在数分钟内完成压缩。结果我们提出了一种新方法,该方法可通过使用参考基因​​组实现高效压缩,但完全避免了比对的需要,从而大大减少了压缩所需的时间。与首先将读取序列与基因组对齐的基于参考的方法相反,我们将所有读取哈希散列到Bloom过滤器中进行编码,并通过使用参考基因​​组的读取长度子序列查询相同的Bloom过滤器进行解码。通过使用这种滤波器的级联来进一步压缩。结论我们的称为BARCODE的方法比基于参考的方法运行速度快一个数量级,而在比对范围更广的序列范围内,其压缩效果比无参考方法好一个数量级。与经过最佳测试的压缩机相比,在高覆盖率(50-100倍)中,BARCODE节省了80-90%的运行时间,而仅略微增加了空间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号