Descriptive document clustering aims to automatically discover groups of seman-tically related documents and to assign a meaningful label to characterise the content of each cluster. In this paper, we present a descriptive clustering ap proach that employs a distributed repre sentation model, namely the paragraph vector model, to capture semantic similar ities between documents and phrases. The proposed method uses a joint representa tion of phrases and documents (i.e., a co-embedding) to automatically select a de scriptive phrase that best represents each document cluster. We evaluate our method by comparing its performance to an ex isting state-of-the-art descriptive cluster ing method that also uses co-embedding but relies on a bag-of-words represen tation. Results obtained on benchmark datasets demonstrate that the paragraph vector-based method obtains superior per formance over the existing approach in both identifying clusters and assigning ap propriate descriptive labels to them.
展开▼