首页> 美国卫生研究院文献>other >Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval
【2h】

Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

机译:使用整数熵代码对化学指纹进行无损压缩可改善存储和检索

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here we combine statistical models of fingerprints with integer entropy codes, such as Golomb and Elias codes, to encode the indices or the run-lengths of the fingerprints. After reordering the fingerprint components by decreasing frequency order, the indices are monotone increasing and the run-lenghts are quasi-monotone increasing, and both exhibit power-law distribution trends. We take advantage of these statistical properties to derive new efficient, lossless, compression algorithms for monotone integer sequences: Monotone Value (MOV) Coding and Monotone Length (MOL) Coding. In contrast with lossy systems that use 1,024 or more bits of storage per molecule, we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule, close to the Shannon entropy limit, using a MOL Elias Gamma code for run-lengths. The improvement in storage comes at a modest computational cost. Furthermore, because the compression is lossless, uncompressed similarity (e.g. Tanimoto) between molecules can be computed exactly from their compressed representations, leading to significant improvements in retrival performance, as shown on six benchmark datasets of drug-like molecules.
机译:许多用于小分子的现代化学信息学系统都依赖于较大的指纹矢量表示形式,其中矢量的组成部分记录了特定组合特征(例如标记的路径或标记的树)的分子图中分子的存在或出现次数。通常使用基于简单模过程的有损压缩方案将这些大指纹矢量压缩为短得多的指纹矢量。在这里,我们将指纹的统计模型与整数熵代码(例如Golomb和Elias代码)结合起来,以对指纹的索引或游程长度进行编码。在通过降低频率顺序对指纹分量进行重新排序之后,指标是单调增加的,步长是准单调增加的,并且都表现出幂律分布趋势。我们利用这些统计属性为单调整数序列推导新的高效,无损压缩算法:单调值(MOV)编码和单调长度(MOL)编码。与每个分子使用1,024或更多位存储的有损系统相比,我们可以使用MOL Elias Gamma代码基于圆形子结构以每分子300多位的方式实现长化学指纹的无损压缩,接近香农熵的极限。对于游程。存储方面的改进以适度的计算成本为代价。此外,由于压缩是无损的,因此可以从其压缩表示形式精确计算分子之间的未压缩相似性(例如Tanimoto),从而导致检索性能的显着提高,如六个类似药物分子的基准数据集所示。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号