...
首页> 外文期刊>GigaScience >Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences
【24h】

Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences

机译:序列压缩基准(SCB)数据库 - 对Fasta格式化序列的无参考压缩机的综合评估

获取原文

摘要

Background: Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for a more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available. Findings: We systematically benchmarked 430 settings of 48 compressors (including 29 specialized sequence compressors and 19 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA, and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 27 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database, http://kirr.dyndns.org/sequence-compression-benchmark/), which allows custom visualizations to be built for selected subsets of benchmark results. Conclusion: We found that modern compressors offer a large improvement in compactness and speed compared to gzip. Our benchmark allows compressors and their settings to be compared using a variety of performance measures, offering the opportunity to select the optimal compressor on the basis of the data type and usage scenario specific to a particular application.
机译:背景:几乎所有分子序列数据库当前使用GZIP进行数据压缩。持续快速累加存储的数据调用更有效的压缩工具。虽然存在许多压缩机,但专业和通用,选择其中一个是困难的,因为没有完全分析它们的序列压缩的比较优势。调查结果:我们系统地在DNA,RNA和蛋白质序列的代表性快速格式的数据集上系统地基准430个设置的48个压缩机(包括29个专用序列压缩机和19通用压缩机)。每个压缩机都在17个性能措施(包括压缩强度)以及压缩和减压所需的时间和内存中进行评估。我们使用了27个测试数据集,包括各种尺寸,DNA和RNA数据集的单个基因组,以及标准蛋白质数据集。我们将结果总结为序列压缩基准数据库(SCB数据库,http://kkirr.dyndns.org/sequence-compression-benchmark/),这允许为基准结果的选定子集构建自定义可视化。结论:与GZIP相比,现代压缩机提供了较大的紧凑性和速度的改善。我们的基准测试允许压缩机及其设置使用各种性能措施进行比较,提供机会根据特定应用程序特定的数据类型和使用情况选择最佳压缩机。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号