...
首页> 外文期刊>Journal of chemical information and modeling >Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time
【24h】

Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time

机译:在线性和亚线性时间内快速精确搜索化学指纹的范围和算法

获取原文
获取原文并翻译 | 示例
           

摘要

Chemical fingerprints are used to represent chemical molecules by recording the presence or absence, or by counting the number of occurrences, of particular features or substructures, such as labeled paths in the 2D graph of bonds, of the corresponding molecule. These fingerprint vectors are used to search large databases of small molecules, currently containing millions of entries, using various similarity measures, such as the Tanimoto or Tversky's measures and their variants. Here, we derive simple bounds on these similarity measures and show how these bounds can be used to considerably reduce the subset of molecules that need to be searched. We consider both the case of single-molecule and multiple-molecule queries, as well as queries based on fixed similarity thresholds or aimed at retrieving the top K hits. We study the speedup as a function of query size and distribution, fingerprint length, similarity threshold, and database size parallel to D parallel to and derive analytical formulas that are in excellent agreement with empirical values. The theoretical considerations and experiments show that this approach can provide linear speedups of one or more orders of magnitude in the case of searches with a fixed threshold, and achieve sublinear speedups in the range of O(parallel to D parallel to(0.6)) for the top K hits in current large databases. This pruning approach yields subsecond search times across the 5 million compounds in the ChemDB database, without any loss of accuracy.
机译:通过记录相应分子的存在或不存在,或通过计数特定特征或子结构(例如在键的二维图形中的标记路径)的出现次数,可以使用化学指纹表示化学分子。这些指纹矢量用于使用各种相似性度量(例如Tanimoto或Tversky的度量及其变体)搜索当前包含数百万个条目的小分子的大型数据库。在这里,我们得出这些相似性度量的简单界限,并显示如何使用这些界限来显着减少需要搜索的分子子集。我们考虑单分子和多分子查询的情况,以及基于固定相似性阈值或旨在检索前K个匹配项的查询。我们研究了作为查询大小和分布,指纹长度,相似性阈值和与D平行的数据库大小的函数的加速,并得出与经验值非常吻合的分析公式。理论上的考虑和实验表明,这种方法可以在具有固定阈值的搜索情况下提供一个或多个数量级的线性加速,并且可以在O(平行于D平行于(0.6))的范围内实现亚线性加速。当前大型数据库中排名前K位的歌曲。这种修剪方法可在ChemDB数据库中对500万种化合物产生亚秒级的搜索时间,而不会降低准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号