Classic information retrieval (IR) systems rely on ranking algorithms to serve users with ordered lists of documents according to search queries. Sometimes, however, users do not have very specific information needs or cannot accurately articulate their information needs in queries. Cluster-based IR systems, such as those based on the Scatter/Gather paradigm, have been used to help users clarify their information needs and promote learning via interactive document clustering and summarization. These systems have the potential to facilitate user browsing large document collections and exploring topics. However, their effectiveness is often constrained by poor clustering quality, ambiguous cluster labels, and the inefficiency to process large-scale data sets.;In interactive clustering, term distributions vary in different clusters or subsets of a collection. Classic TF*IDF (term frequency * inverse document frequency) term weighting, especially IDF that counts document frequency in the overall (global) data, does not take into account the shifted term distributions in a (local) subset and is often incapable of identifying most informative terms within that subset. To improve clustering quality with meaningful labels, we propose two novel term weighting schemes, namely TF*ICDF and DF*LIG. TF*ICDF, or Term Frequency * Inverse within-Cluster Document Frequency, integrates the local subset information into term weighting. It outperforms TF*IDF in several aspects for clustering and labeling with various configurations.;In addition, we propose Least Information Gain (LIG) based on the least information theory, which, similar to Information Gain (IG) based on KL divergence, measures the amount of information required for a probability distribution change. Based on LIG, we develop the DF*LIG method for cluster labeling. With DF*LIG, terms that carry more information in revealing the contents of clusters are chosen as labels, resulting in better performance in terms of coverage, overlap and precision in comparison to DF*IG. By integrating TF*ICDF for term weighting and clustering, DF*LIG produces more representative, distinctive and accurate labels than when it is combined with TF*IDF.;In order to improve clustering efficiency and support data-intensive processing, we develop distributed versions of TF*ICDF and DF*LIG algorithms as well as a parallel clustering algorithm named Pruned Affinity Propagation (PAP) in the Spark framework. The proposed algorithms efficiently process large-scale data sets by taking advantage of computational capabilities of individual processors and nodes. Distributed TF*ICDF and DF*LIG methods scale very well---their efficiency improves significantly with an increased number of processors. Compared with the original affinity propagation algorithm, PAP achieves much higher efficiency while maintaining strong effectiveness. Results also show that the execution time of PAP is greatly reduced by increasing the number of processors and remains competitive with large numbers of documents, indicating its scalability.;With the support of these effective and scalable methods for text clustering and cluster labeling, a cluster-based IR system can be greatly improved in its ability to dynamically identify key features, to produce meaningful clusters, and to generate representative terms as labels. With the ability to accommodate large-scale data sets, such a system can help users discover important patterns in the data and help them learn and explore in a dynamic, complex information space.
展开▼