Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time

Swamidass SJ; Baldi P

首页> 外文期刊>Journal of chemical information and modeling >Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time

【24h】

Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time

机译：在线性和亚线性时间内快速精确搜索化学指纹的范围和算法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Chemical fingerprints are used to represent chemical molecules by recording the presence or absence, or by counting the number of occurrences, of particular features or substructures, such as labeled paths in the 2D graph of bonds, of the corresponding molecule. These fingerprint vectors are used to search large databases of small molecules, currently containing millions of entries, using various similarity measures, such as the Tanimoto or Tversky's measures and their variants. Here, we derive simple bounds on these similarity measures and show how these bounds can be used to considerably reduce the subset of molecules that need to be searched. We consider both the case of single-molecule and multiple-molecule queries, as well as queries based on fixed similarity thresholds or aimed at retrieving the top K hits. We study the speedup as a function of query size and distribution, fingerprint length, similarity threshold, and database size parallel to D parallel to and derive analytical formulas that are in excellent agreement with empirical values. The theoretical considerations and experiments show that this approach can provide linear speedups of one or more orders of magnitude in the case of searches with a fixed threshold, and achieve sublinear speedups in the range of O(parallel to D parallel to(0.6)) for the top K hits in current large databases. This pruning approach yields subsecond search times across the 5 million compounds in the ChemDB database, without any loss of accuracy.

机译：通过记录相应分子的存在或不存在，或通过计数特定特征或子结构（例如在键的二维图形中的标记路径）的出现次数，可以使用化学指纹表示化学分子。这些指纹矢量用于使用各种相似性度量（例如Tanimoto或Tversky的度量及其变体）搜索当前包含数百万个条目的小分子的大型数据库。在这里，我们得出这些相似性度量的简单界限，并显示如何使用这些界限来显着减少需要搜索的分子子集。我们考虑单分子和多分子查询的情况，以及基于固定相似性阈值或旨在检索前K个匹配项的查询。我们研究了作为查询大小和分布，指纹长度，相似性阈值和与D平行的数据库大小的函数的加速，并得出与经验值非常吻合的分析公式。理论上的考虑和实验表明，这种方法可以在具有固定阈值的搜索情况下提供一个或多个数量级的线性加速，并且可以在O（平行于D平行于（0.6））的范围内实现亚线性加速。当前大型数据库中排名前K位的歌曲。这种修剪方法可在ChemDB数据库中对500万种化合物产生亚秒级的搜索时间，而不会降低准确性。

著录项

来源
《Journal of chemical information and modeling》 |2007年第2期|共16页
作者
Swamidass SJ; Baldi P;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类化学;
关键词
SMALL MOLECULES; SIMILARITY; DATABASE; STRINGS; KERNELS;

机译：小分子;相似性;数据库;字符串;内核;

相似文献

外文文献
中文文献
专利

1. Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time [J] . Swamidass SJ, Baldi P Journal of chemical information and modeling . 2007,第2期

机译：在线性和亚线性时间内快速精确搜索化学指纹的范围和算法
2. Sublinear-Space and Bounded-Delay Algorithms for Maximal Clique Enumeration in Graphs [J] . Conte Alessio, Grossi Roberto, Marino Andrea, Algorithmica . 2020,第6期

机译：图中最大集团枚举的亚线性空间和有界延迟算法
3. Linear and sublinear time algorithms for the basis of abelian groups [J] . Li Chen, Bin Fu Theoretical computer science . 2011,第32期

机译：基于阿贝尔群的线性和亚线性时间算法
4. Sublinear-Time Algorithms for Monomer-Dimer Systems on Bounded Degree Graphs [C] . Marc Lelarge, Hang Zhou . 2013

机译：有界图上单体-二聚体系统的亚线性时间算法
5. Sublinear geometric algorithms and geometric lower bounds. [D] . Liu, Ding. 2005

机译：亚线性几何算法和几何下界。
6. Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sub-Linear Time [O] . S. Joshua Swamidass, Pierre Baldi -1

机译：在线性和亚线性时间内快速精确搜索化学指纹的界限和算法
7. Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time [O] . S. Joshua Swamidass, Pierre Baldi 2007

机译：用于在线性和次线性时间内快速精确搜索化学指纹的界限和算法

Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time

摘要

著录项

相似文献

相关主题

期刊订阅