Visualization of very large high-dimensional data sets as minimum spanning trees

Daniel Probst; Jean-Louis Reymond

摘要

The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of detail to allow for human inspection and interpretation. Here, we propose a solution to this problem with a new data visualization method, TMAP, capable of representing data sets of up to millions of data points and arbitrary high dimensionality as a two-dimensional tree (http://tmap.gdb.tools). Visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large data sets due to their tree-like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. We apply TMAP to the most used chemistry data sets including databases of molecules such as ChEMBL, FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet benchmark collection of data sets. We also show its broad applicability with further examples from biology, particle physics, and literature.

机译：化学科学生产含有化学结构和相关性能的前所未有的大型高尺寸数据集。然而，目前没有算法可视化这些数据，同时保留具有足够水平的细节，以允许人类检查和解释的全局和本地特征。在这里，我们提出了一种解决问题的解决方案，通过新的数据可视化方法，TMAP，能够表示最多数百万数据点的数据集和任意高维数作为二维树（http://tmap.gdb.tools ）。基于TMAP的可视化更适合T-SNE或UMAP，因为由于它们的树状自然，增加了本地和全球邻域和结构保存，因此对大数据集的探索和解释，以及算法基于方法的方法的透明度。我们将TMAP应用于最常用的化学数据集，包括ChemBl，FDB17，天然产品Atlas，Dsstox以及数据集的分子基准集合等分子等分子数据库。我们还具有与生物学，粒子物理和文学的进一步实例的广泛适用性。

Visualization of very large high-dimensional data sets as minimum spanning trees

摘要

著录项

相关主题

期刊订阅