首页> 美国卫生研究院文献>Bioinformatics >smallWig: parallel compression of RNA-seq WIG files
【2h】

smallWig: parallel compression of RNA-seq WIG files

机译:smallWig:RNA-seq WIG文件的并行压缩

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

>Contributions: We developed a new lossless compression method for WIG data, named smallWig, offering the best known compression rates for RNA-seq data and featuring random access functionalities that enable visualization, summary statistics analysis and fast queries from the compressed files. Our approach results in order of magnitude improvements compared with bigWig and ensures compression rates only a fraction of those produced by cWig. The key features of the smallWig algorithm are statistical data analysis and a combination of source coding methods that ensure high flexibility and make the algorithm suitable for different applications. Furthermore, for general-purpose file compression, the compression rate of smallWig approaches the empirical entropy of the tested WIG data. For compression with random query features, smallWig uses a simple block-based compression scheme that introduces only a minor overhead in the compression rate. For archival or storage space-sensitive applications, the method relies on context mixing techniques that lead to further improvements of the compression rate. Implementations of smallWig can be executed in parallel on different sets of chromosomes using multiple processors, thereby enabling desirable scaling for future transcriptome Big Data platforms.>Motivation: The development of next-generation sequencing technologies has led to a dramatic decrease in the cost of DNA/RNA sequencing and expression profiling. RNA-seq has emerged as an important and inexpensive technology that provides information about whole transcriptomes of various species and organisms, as well as different organs and cellular communities. The vast volume of data generated by RNA-seq experiments has significantly increased data storage costs and communication bandwidth requirements. Current compression tools for RNA-seq data such as bigWig and cWig either use general-purpose compressors (gzip) or suboptimal compression schemes that leave significant room for improvement. To substantiate this claim, we performed a statistical analysis of expression data in different transform domains and developed accompanying entropy coding methods that bridge the gap between theoretical and practical WIG file compression rates.>Results: We tested different variants of the smallWig compression algorithm on a number of integer-and real- (floating point) valued RNA-seq WIG files generated by the ENCODE project. The results reveal that, on average, smallWig offers 18-fold compression rate improvements, up to 2.5-fold compression time improvements, and 1.5-fold decompression time improvements when compared with bigWig. On the tested files, the memory usage of the algorithm never exceeded 90 KB. When more elaborate context mixing compressors were used within smallWig, the obtained compression rates were as much as 23 times better than those of bigWig. For smallWig used in the random query mode, which also supports retrieval of the summary statistics, an overhead in the compression rate of roughly 3–17% was introduced depending on the chosen system parameters. An increase in encoding and decoding time of 30% and 55% represents an additional performance loss caused by enabling random data access. We also implemented smallWig using multi-processor programming. This parallelization feature decreases the encoding delay 2–3.4 times compared with that of a single-processor implementation, with the number of processors used ranging from 2 to 8; in the same parameter regime, the decoding delay decreased 2–5.2 times.>Availability and implementation: The smallWig software can be downloaded from: , , .>Contact: >Supplementary information: are available at Bioinformatics online.
机译:>贡献:我们为WIG数据开发了一种新的无损压缩方法,名为smallWig,它为RNA-seq数据提供了最广为人知的压缩率,并具有随机访问功能,可实现可视化,摘要统计分析和快速查询。压缩文件。与bigWig相比,我们的方法可提高幅度,并确保压缩率仅是cWig产生的压缩率的一小部分。 smallWig算法的关键功能是统计数据分析以及源编码方法的组合,这些方法可确保高度的灵活性并使该算法适合于不同的应用。此外,对于通用文件压缩,smallWig的压缩率接近测试的WIG数据的经验熵。对于具有随机查询功能的压缩,smallWig使用基于块的简单压缩方案,该方案仅在压缩率中引入了较小的开销。对于档案或存储空间敏感的应用程序,该方法依赖于上下文混合技术,从而进一步提高了压缩率。可以使用多个处理器在不同的染色体组上并行执行smallWig的实现,从而为未来的转录组大数据平台实现理想的缩放。>动机:下一代测序技术的发展带来了巨大的发展降低DNA / RNA测序和表达谱分析的成本。 RNA-seq已成为一种重要且廉价的技术,可提供有关各种物种和生物体以及不同器官和细胞群落的完整转录组信息。 RNA序列实验产生的大量数据显着增加了数据存储成本和通信带宽要求。当前用于RNA序列数据的压缩工具(例如bigWig和cWig)使用通用压缩器(gzip)或次优压缩方案,这些压缩方案仍有很大的改进空间。为了证实这一说法,我们对不同变换域中的表达数据进行了统计分析,并开发了伴随的熵编码方法,以弥合理论和实际WIG文件压缩率之间的差距。>结果:我们测试了在ENCODE项目生成的许多整数和浮点值的RNA-seq WIG文件上使用smallWig压缩算法。结果表明,与bigWig相比,smallWig平均可将压缩率提高18倍,压缩时间最多可提高2.5倍,解压缩时间可提高1.5倍。在测试的文件上,算法的内存使用量从未超过90 KB。当在smallWig中使用更复杂的上下文混合压缩机时,获得的压缩率比bigWig的压缩率高23倍。对于在随机查询模式下使用的smallWig(它也支持摘要统计信息的检索),根据所选系统参数,压缩率的开销大约为3-17%。编码和解码时间增加30%和55%表示由于启用随机数据访问而导致的其他性能损失。我们还使用多处理器编程实现了smallWig。与单处理器实现相比,此并行化功能将编码延迟减少了2–3.4倍,所使用的处理器数量为2至8;在相同的参数范围内,解码延迟减少了2–5.2倍。>可用性和实现: smallWig软件可以从,,.. >联系人: >补充下载信息:可从生物信息学在线获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号