Methods are disclosed for clustering biological samples and other objects using a grand canonical ensemble. A biological sample is characterized by data attributes from varying sources (e.g. NGS, other types of high-dimensional cytometric data, observed disease state) and of varying data types (e.g. Boolean, continuous, or coded sets) organized as vectors (as many as 109) having as many as 106, 109, or more components. The biological samples or observational data are modeled as particles of a grand canonical ensemble which can be variably distributed among partitions. A pseudo-energy is defined as a measure of inverse similarity between the particles. Minimization of grand canonical ensemble pseudo-energy corresponds to clustering maximally similar particles in each partition, thereby determining clusters of the biological samples. The sample clusters can be used for feature discovery, gene and pathway identification, and development of cell based therapeutics, or for other purposes. Variations and additional applications are disclosed.
展开▼