首页> 外文期刊>Bioinformatics >SCALCE: boosting sequence compression algorithms using locally consistent encoding
【24h】

SCALCE: boosting sequence compression algorithms using locally consistent encoding

机译:SCALCE:使用本地一致编码来增强序列压缩算法

获取原文
获取原文并翻译 | 示例
       

摘要

Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a ‘boosting’ scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome. Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19—when the goal is to compress the reads alone. In fact, on SCALCE reordered reads, gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE + gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2, SCALCE + gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names, in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time.
机译:动机:高通量测序(HTS)平台生成前所未有的数据量,这给计算基础架构带来了挑战。数据管理,存储和分析已成为采用新平台的人们的主要后勤障碍。为此目的需要大量投资,这几乎预示着位于美国国家生物技术信息中心(NCBI)的序列读取档案即将结束,该中心保存着全球范围内产生的大部分序列数据。当前,大多数HTS数据是通过通用算法(例如gzip)压缩的。这些算法不适用于压缩由HTS平台生成的数据。例如,它们没有利用基因组序列数据的特殊性质,即字母大小有限且读段之间的相似度很高。专为HTS数据设计的快速高效的压缩算法应该能够解决数据管理,存储和通信中的某些问题。如果这些算法提供其他功能,例如对任何读取和索引的随机访问,以进行有效的序列相似性搜索,它们也将有助于分析。在这里,我们介绍SCALCE,这是一种基于局部一致的解析技术的“增强”方案,该重组以导致更高的压缩速度和压缩率的方式重组读取,而与使用的压缩算法无关,并且不使用参考基因​​组。结果:我们的测试表明,当目标是仅压缩读取数据时,SCALCE可以将通过gzip实现的压缩率提高4.19倍。实际上,在SCALCE重新排序的读取中,在具有单核和6 GB内存的标准PC上,gzip运行时间可以提高15.06倍。有趣的是,甚至SCALCE + gzip的运行时间也将gzip的运行时间提高了2.09倍。与最近发布的BEETL相比,BEETL的目的是按字典顺序对(反向)读取进行排序以改善bzip2,SCALCE + gzip提供了高达2.01倍的压缩,同时将运行时间缩短了5.17倍。除了读取内容本身,SCALCE还提供了压缩质量得分和读取名称的选项。这是通过以下方式实现的:通过3阶算术编码(AC)压缩质量得分,并通过对读取提供重新排序的SCALCE通过gzip压缩读取名称。这样,与无序FASTQ文件的gzip压缩(包括读取,读取的名称和质量得分)相比,SCALCE(与gzip和算术编码一起)可以提供高达3.34的压缩率改进和1.26的运行时间改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号