首页> 外文会议>International Conference on Large-Scale Knowledge Resources >Comparing LDA with pLSI as a Dimensionality Reduction Method in Document Clustering
【24h】

Comparing LDA with pLSI as a Dimensionality Reduction Method in Document Clustering

机译:将LDA与PLSI进行比较作为文档聚类中的维度减少方法

获取原文

摘要

In this paper, we compare latent Dirichlet allocation (LDA) with probabilistic latent semantic indexing (pLSI) as a dimensionality reduction method and investigate their effectiveness in document clustering by using real-world document sets. For clustering of documents, we use a method based on multinomial mixture, which is known as an efficient framework for text mining. Clustering results are evaluated by F-measure, i.e., harmonic mean of precision and recall. We use Japanese and Korean Web articles for evaluation and regard the category assigned to each Web article as the ground truth for the evaluation of clustering results. Our experiment shows that the dimensionality reduction via LDA and pLSI results in document clusters of almost the same quality as those obtained by using original feature vectors. Therefore, we can reduce the vector dimension without degrading cluster quality. Further, both LDA and pLSI are more effective than random projection, the baseline method in our experiment. However, our experiment provides no meaningful difference between LDA and pLSI. This result suggests that LDA does not replace pLSI at least for dimensionality reduction in document clustering.
机译:在本文中,我们将潜在的Dirichlet分配(LDA)与概率潜入语义索引(PLSI)进行比较,作为维度减少方法,并通过使用现实世界文档集调查文档聚类中的有效性。对于文档的聚类,我们使用基于多项式混合物的方法,该方法被称为文本挖掘的有效框架。聚类结果由F-Measure,即谐波均衡的谐波均值评估。我们使用日语和韩国网络文章进行评估,并将分配给每个Web文章分配的类别作为评估聚类结果的基础事实。我们的实验表明,通过LDA和PLSI的维度降低导致几乎与使用原始特征向量获得的文档簇。因此,我们可以减少矢量维度而不会降低群集质量。此外,LDA和PLSI都比随机投影更有效,基线方法在我们的实验中。但是,我们的实验在LDA和PLSI之间没有提供有意义的差异。该结果表明,LDA至少替换PLSI至少用于文档聚类的维度减少。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号