首页> 美国卫生研究院文献>other >Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

【2h】

Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

机译：使用整数熵代码对化学指纹进行无损压缩可改善存储和检索

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here we combine statistical models of fingerprints with integer entropy codes, such as Golomb and Elias codes, to encode the indices or the run-lengths of the fingerprints. After reordering the fingerprint components by decreasing frequency order, the indices are monotone increasing and the run-lenghts are quasi-monotone increasing, and both exhibit power-law distribution trends. We take advantage of these statistical properties to derive new efficient, lossless, compression algorithms for monotone integer sequences: Monotone Value (MOV) Coding and Monotone Length (MOL) Coding. In contrast with lossy systems that use 1,024 or more bits of storage per molecule, we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule, close to the Shannon entropy limit, using a MOL Elias Gamma code for run-lengths. The improvement in storage comes at a modest computational cost. Furthermore, because the compression is lossless, uncompressed similarity (e.g. Tanimoto) between molecules can be computed exactly from their compressed representations, leading to significant improvements in retrival performance, as shown on six benchmark datasets of drug-like molecules.

机译：许多用于小分子的现代化学信息学系统都依赖于较大的指纹矢量表示形式，其中矢量的组成部分记录了特定组合特征（例如标记的路径或标记的树）的分子图中分子的存在或出现次数。通常使用基于简单模过程的有损压缩方案将这些大指纹矢量压缩为短得多的指纹矢量。在这里，我们将指纹的统计模型与整数熵代码（例如Golomb和Elias代码）结合起来，以对指纹的索引或游程长度进行编码。在通过降低频率顺序对指纹分量进行重新排序之后，指标是单调增加的，步长是准单调增加的，并且都表现出幂律分布趋势。我们利用这些统计属性为单调整数序列推导新的高效，无损压缩算法：单调值（MOV）编码和单调长度（MOL）编码。与每个分子使用1,024或更多位存储的有损系统相比，我们可以使用MOL Elias Gamma代码基于圆形子结构以每分子300多位的方式实现长化学指纹的无损压缩，接近香农熵的极限。对于游程。存储方面的改进以适度的计算成本为代价。此外，由于压缩是无损的，因此可以从其压缩表示形式精确计算分子之间的未压缩相似性（例如Tanimoto），从而导致检索性能的显着提高，如六个类似药物分子的基准数据集所示。

著录项

期刊名称 other
作者
Pierre Baldi; Ryan W. Benz; Daniel S. Hirschberg; S. Joshua Swamidass;
展开▼
作者单位

展开▼
年(卷),期 -1(47),6
年度 -1
页码 2098–2109
总页数 31
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval [J] . Pierre Baldi, Ryan W.Benz, Daniel S.Hirschberg Journal of chemical information and modeling . 2007,第6期

机译：使用整数熵代码对化学指纹进行无损压缩可改善存储和检索
2. Lossless Compression Performance of a Simple Counter-Based Entropy Coder [J] . Armein Z R Langi Journal of ICT Research and Applications . 2011,第3期

机译：简单的基于计数器的熵编码器的无损压缩性能
3. Lossless Compression Performance of a Simple Counter-Based Entropy Coder [J] . Armein Z R Langi Journal of ICT Research and Applications . 2011,第3期

机译：简单的基于计数器的熵编码器的无损压缩性能
4. The Rice Coding algorithm achieves high-performance lossless and progressive image compression basing on the improving of integer lifting scheme Rice Coding algorithm [C] . Xie Cheng Jun, Yan Su, Zhang Wei Applications of Digital Image Processing XXIX . 2006

机译：Rice编码算法在整数提升方案的基础上实现了高性能的无损渐进图像压缩。Rice编码算法
5. Lossless Image Compression Using Reversible Integer Wavelet Transforms and Convolutional Neural Networks [D] . Ahanonu, Eze. 2018

机译：使用可逆整数小波变换和卷积神经网络的无损图像压缩
6. Lossless quantum data compression with exponential penalization: an operational interpretation of the quantum Rényi entropy [O] . Guido Bellomo, Gustavo M. Bosyk, Federico Holik, -1

机译：具有指数惩罚的无损量子数据压缩：量子Rényi熵的可操作解释
7. Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval [O] . Pierre Baldi, Ryan W. Benz, Daniel S. Hirschberg, 2007

机译：使用整数熵代码对化学指纹的无损压缩改进了存储和检索
8. Exploration of the Operational Ramifications of Lossless Compression of 1000 ppi Fingerprint Imagery. [R] . Orandi, S., Libert, J. M., Ko, K., 2012

机译：1000 ppi指纹图像无损压缩操作分歧的探讨。

Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

摘要

著录项

相似文献

相关主题

期刊订阅