SCALCE: boosting sequence compression algorithms using locally consistent encoding

Hach Faraz; Numanagic Ibrahim; Alkan Can; Sahinalp S. Cenk

首页> 外文期刊>Bioinformatics >SCALCE: boosting sequence compression algorithms using locally consistent encoding

【24h】

SCALCE: boosting sequence compression algorithms using locally consistent encoding

机译：SCALCE：使用本地一致编码来增强序列压缩算法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Motivation: The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of the Sequence Read Archive hosted at the National Center for Biotechnology Information (NCBI), which holds most of the sequence data generated world wide. Currently, most HTS data are compressed through general purpose algorithms such as gzip. These algorithms are not designed for compressing data generated by the HTS platforms; for example, they do not take advantage of the specific nature of genomic sequence data, that is, limited alphabet size and high similarity among reads. Fast and efficient compression algorithms designed specifically for HTS data should be able to address some of the issues in data management, storage and communication. Such algorithms would also help with analysis provided they offer additional capabilities such as random access to any read and indexing for efficient sequence similarity search. Here we present SCALCE, a ‘boosting’ scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome. Results: Our tests indicate that SCALCE can improve the compression rate achieved through gzip by a factor of 4.19—when the goal is to compress the reads alone. In fact, on SCALCE reordered reads, gzip running time can improve by a factor of 15.06 on a standard PC with a single core and 6 GB memory. Interestingly even the running time of SCALCE + gzip improves that of gzip alone by a factor of 2.09. When compared with the recently published BEETL, which aims to sort the (inverted) reads in lexicographic order for improving bzip2, SCALCE + gzip provides up to 2.01 times better compression while improving the running time by a factor of 5.17. SCALCE also provides the option to compress the quality scores as well as the read names, in addition to the reads themselves. This is achieved by compressing the quality scores through order-3 Arithmetic Coding (AC) and the read names through gzip through the reordering SCALCE provides on the reads. This way, in comparison with gzip compression of the unordered FASTQ files (including reads, read names and quality scores), SCALCE (together with gzip and arithmetic encoding) can provide up to 3.34 improvement in the compression rate and 1.26 improvement in running time.

机译：动机：高通量测序（HTS）平台生成前所未有的数据量，这给计算基础架构带来了挑战。数据管理，存储和分析已成为采用新平台的人们的主要后勤障碍。为此目的需要大量投资，这几乎预示着位于美国国家生物技术信息中心（NCBI）的序列读取档案即将结束，该中心保存着全球范围内产生的大部分序列数据。当前，大多数HTS数据是通过通用算法（例如gzip）压缩的。这些算法不适用于压缩由HTS平台生成的数据。例如，它们没有利用基因组序列数据的特殊性质，即字母大小有限且读段之间的相似度很高。专为HTS数据设计的快速高效的压缩算法应该能够解决数据管理，存储和通信中的某些问题。如果这些算法提供其他功能，例如对任何读取和索引的随机访问，以进行有效的序列相似性搜索，它们也将有助于分析。在这里，我们介绍SCALCE，这是一种基于局部一致的解析技术的“增强”方案，该重组以导致更高的压缩速度和压缩率的方式重组读取，而与使用的压缩算法无关，并且不使用参考基因组。结果：我们的测试表明，当目标是仅压缩读取数据时，SCALCE可以将通过gzip实现的压缩率提高4.19倍。实际上，在SCALCE重新排序的读取中，在具有单核和6 GB内存的标准PC上，gzip运行时间可以提高15.06倍。有趣的是，甚至SCALCE + gzip的运行时间也将gzip的运行时间提高了2.09倍。与最近发布的BEETL相比，BEETL的目的是按字典顺序对（反向）读取进行排序以改善bzip2，SCALCE + gzip提供了高达2.01倍的压缩，同时将运行时间缩短了5.17倍。除了读取内容本身，SCALCE还提供了压缩质量得分和读取名称的选项。这是通过以下方式实现的：通过3阶算术编码（AC）压缩质量得分，并通过对读取提供重新排序的SCALCE通过gzip压缩读取名称。这样，与无序FASTQ文件的gzip压缩（包括读取，读取的名称和质量得分）相比，SCALCE（与gzip和算术编码一起）可以提供高达3.34的压缩率改进和1.26的运行时间改进。

著录项

来源
《Bioinformatics》 |2012年第23期|共7页
作者
Hach Faraz; Numanagic Ibrahim; Alkan Can; Sahinalp S. Cenk;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类生物工程学（生物技术）;
关键词

相似文献

外文文献
中文文献
专利

1. SCALCE: boosting sequence compression algorithms using locally consistent encoding [J] . Hach Faraz, Numanagic Ibrahim, Alkan Can, Bioinformatics . 2012,第23期

机译：SCALCE：使用本地一致编码来增强序列压缩算法
2. Boosting Text Compression with Word-Based Statistical Encoding [J] . Antonio Farina, Gonzalo Navarro, Jose R. Parama The Computer journal . 2012,第1期

机译：通过基于单词的统计编码提高文本压缩
3. Boosting Text Compression with Word-Based Statistical Encoding1 [J] . Antonio Fariña, Gonzalo Navarro, José R. Paramá Computer Journal, The . 2012,第1期

机译：使用基于单词的统计编码促进文本压缩 1
4. Compression for DNA Sequences Using Huffman Encoding [C] . Raju Bhukya, Subodh Yadav, Jitendra Kumar Sharma, International Conference on ICT on Sustainable Development . 2020

机译：使用Huffman编码压制DNA序列
5. New wavelet-based algorithms for signal decomposition and reconstruction via the theory of circular stationary vector sequences and the Zak transform with applications to image compression. [D] . Polyak, Nikolay. 1998

机译：通过基于圆形平稳矢量序列和Zak变换的理论，基于小波的信号分解和重构新算法在图像压缩中的应用。
6. SCALCE: boosting sequence compression algorithms using locally consistent encoding [O] . Faraz Hach, Ibrahim Numanagić, Can Alkan, -1

机译：SCALCE：使用本地一致编码来增强序列压缩算法
7. New multicategory boosting algorithms based on multicategory Fisher-consistent losses [O] . Hui Zou, Ji Zhu, Trevor Hastie 2013

机译：基于多类Fisher一致损失的新型多类提升算法
8. Study of Synthetic Aperture Radar Data Compression and Encoding. Part 3. Performance Evaluation of Speckle Suppression and Data Compression Algorithms [R] . Huisman, W. C., Verhoef, W., Okkes, R. W. 1986

机译：合成孔径雷达数据压缩编码研究。第3部分：散斑抑制和数据压缩算法的性能评估

SCALCE: boosting sequence compression algorithms using locally consistent encoding

摘要

著录项

相似文献

相关主题

期刊订阅