...
【24h】

Short text clustering based on Pitman-Yor process mixture model

机译:基于Pitman-Yor Process混合模型的短文本聚类

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

For finding the appropriate number of clusters in short text clustering, models based on Dirichlet Multinomial Mixture (DMM) require the maximum possible cluster number before inferring the real number of clusters. However, it is difficult to choose a proper number as we do not know the true number of clusters in short texts beforehand. The cluster distribution in DMM based on Dirichlet process as prior goes down exponentially as the number of clusters increases. Therefore, we propose a novel model based on Pitman-Yor Process to capture the power-law phenomenon of the cluster distribution in the paper. Specifically, each text chooses one of the active clusters or a new cluster with probabilities derived from the Pitman-Yor Process Mixture model (PYPM). Discriminative words and nondiscriminative words are identified automatically to help enhance text clustering. Parameters are estimated efficiently by collapsed Gibbs sampling and experimental results show PYPM is robust and effective comparing with the state-of-the-art models.
机译:为了在短文本聚类中找到适当数量的群集,基于Dirichlet多项式混合物(DMM)的模型需要最大可能的簇数在推断出真实的簇之前。但是,很难选择一个适当的数字,因为我们事先不知道短文本中的群集数量。基于Dirichlet进程的DMM中的群集分布随着簇的数量增加而下降。因此,我们提出了一种基于Pitman-Yor过程的新型模型,以捕获纸张中集群分布的幂律现象。具体地,每个文本选择一个活动簇或新簇之一,其中概率来自Pitman-yor过程混合模型(PYPM)。自动识别辨别单词和非歧视词以帮助增强文本聚类。通过折叠的GIBBS采样和实验结果表明,与最先进的模型相比,PYPM估计参数估计和实验结果表明,与最先进的模型相比,PYPM具有稳健性和有效的比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号