A document partitioning (flat clustering) method clusters documents with high accuracy and accurately estimates the number of clusters in the document corpus (i.e. provides a model selection capability). To accurately cluster the given document corpus, a richer feature set is employed to represent each document, and the Gaussian Mixture Model (GMM) together with the Expectation-Maximization (EM) algorithm is used to conduct an initial document clustering. From this initial result, a set of discriminative features is identified for each cluster, and the initially obtained document clusters are refined by voting on the cluster label for each document using this discriminative feature set. This self refinement process of discriminative feature identification and cluster label voting is iteratively applied until the convergence of document clusters. Furthermore, a model selection capability is achieved by introducing randomness in the cluster initialization stage, and then discovering a value C for the number of clusters N by which running the document clustering process for a fixed number of times yields sufficiently similar results.
展开▼