To effectively cluster corpus of ordinary documents and digital books, the clustering algorithms based on LDA model and TC_ LDA were proposed, respectively. The topic model named TC_ LDA,the extension of LDA,is proposed for digital books corpus for jointly topic modeling from both of Texts and Contents. Unlike traditional clustering methods, topic model based methods cluster documents in a group if they share one or more common topics. Empirical evaluation demonstrates that our approach based on topic analysis can substantially improve the clustering results as compared to related methods.%为了实现普通文本语料库和数字图书语料库的有效聚类,分别提出基于传统LDA(Latent Dirichlet Allocation)模型和TC_ LDA模型的聚类算法.TC_ LDA模型在LDA模型基础上进行扩展,通过对图书文档的目录和正文信息联合进行主题建模.和传统方法不同,基于主题模型的聚类算法能将具备同一主题的文档聚为一类.实验结果表明从主题分析角度出发实现的聚类算法优于传统的聚类算法.
展开▼