...
首页> 外文期刊>Journal of Cheminformatics >Consistency of systematic chemical identifiers within and between small-molecule databases
【24h】

Consistency of systematic chemical identifiers within and between small-molecule databases

机译:小分子数据库之内和之间的系统化学标识符的一致性

获取原文
           

摘要

Background Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure–property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation. Results The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%). Conclusions We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.
机译:背景公共和商业化学数据库中结构和相关元数据的正确性极大地影响了药物发现研究活动,例如定量结构-性质关系建模和化合物新颖性检查。 MOL文件,SMILES表示法,IUPAC名称和InChI字符串是普遍存在的文件格式和化学结构的系统标识符。尽管可互换用于许多化学信息学目的,但由于各种数据集成方法(包括使用不同的软件和不同的结构标准化规则),因此尚未对这些结构标识符的不一致进行研究。我们研究了一些常用化学资源内和之间的小分子系统标识符的一致性,有无结构标准化。结果在数据源之间,系统化学标识符与它们相应的MOL表示之间的一致性差异很大(37.2%-98.5%)。我们观察到MOL-IUPAC名称的整体一致性最低。忽略立体化学,会增加一致性(84.8%至99.9%)。通过交叉引用链接的化合物的MOL表示之间的一致性也存在很大差异(25.8%至93.7%)。去除立体化学可提高一致性(47.6%至95.6%)。结论我们已经表明,数据库内部和数据库之间的结构表示和系统化学标识符存在相当大的不一致。尤其是在合并数据时,以及如果系统标识符用作结构集成或交叉查询多个数据库的关键索引时,这可能会产生很大的影响。从其MOL表示开始重新生成系统标识符,并在创建它们之前对所有化合物应用明确定义并记录在案的化学标准化规则,可以大大提高内部一致性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号