This paper describes treeClust, an R package that produces dissimilarities useful for clustering.These dissimilarities arise from a set of classification or regression trees, one with each variable inthe data acting in turn as a the response, and all others as predictors. This use of trees produces dissimilarities that are insensitive to scaling, benefit from automatic variable selection, and appear to performwell. The software allows a number of options to be set, affecting the set of objects returned in the call;the user can also specify a clustering algorithm and, optionally, return only the clustering vector. Thepackage can also generate a numeric data set whose inter-point distances relate to the treeClust ones;such a numeric data set can be much smaller than the vector of inter-point dissimilarities, a usefulfeature in big data sets.
展开▼