首页> 外文会议>Advances in computer science and information technology >Topic Detection by Topic Model Induced Distance Using Biased Initiation
【24h】

Topic Detection by Topic Model Induced Distance Using Biased Initiation

机译:使用偏向引发的主题模型诱导距离进行主题检测

获取原文
获取原文并翻译 | 示例

摘要

Clustering is widely used in topic detection task. However, the vector space model based distance, such as cosine-like distance, will get a low precision and recall when the corpus contains many related topics. In this paper, we propose a new distance measure method: the Topic Model (TM) induced distance. Assuming that the distribution of word is different in each topic, the documents can be treated as a sample of the mixture of k topic models, which can be estimated using expectation maximization (EM). A biased initiation method is proposed in this paper for topic decomposition using EM, which will generate a converged matrix for the generation of TM induced distance. The collections of web news are clustered into classes using this TM distance. A series of experiments are described on a corpus containing 5033 web news from 30 topics. K-means clustering is processed on test set with different topic numbers. A comparison of clustering result using the TM induced distance and the traditional cosine-like distance are given. The experiment results show that the proposed topic decomposition method using biased initiation is effective than the topic decomposition using random values. The TM induced distance will generate more topical groups than the VS model based cosine-like distance. In the web news collections containing related topics, the TM induced distance can achieve a better precision and recall.
机译:聚类广泛用于主题检测任务。但是,当语料库包含许多相关主题时,基于矢量空间模型的距离(例如类似余弦的距离)将获得较低的精度和召回率。在本文中,我们提出了一种新的距离测量方法:主题模型(TM)诱导距离。假设每个主题中单词的分布不同,则可以将文档视为k个主题模型混合的样本,可以使用期望最大化(EM)进行估计。本文提出了一种偏向的起始方法,用于使用EM进行主题分解,这将生成一个收敛矩阵,用于生成TM诱导距离。使用此TM距离,网络新闻的集合被聚类为类。在包含30个主题的5033个网络新闻的语料库上描述了一系列实验。在具有不同主题编号的测试集上处理K均值聚类。给出了使用TM感应距离和传统余弦状距离的聚类结果的比较。实验结果表明,提出的基于偏向起始的主题分解方法比基于随机值的主题分解方法有效。与基于VS模型的基于余弦的距离相比,TM引起的距离将产生更多的主题组。在包含相关主题的网络新闻集中,TM引起的距离可以实现更好的精度和召回率。

著录项

  • 来源
  • 会议地点 Miyazaki(JP);Miyazaki(JP);Miyazaki(JP);Miyazaki(JP);Miyazaki(JP);Miyazaki(JP);Miyazaki(JP);Miyazaki(JP)
  • 作者单位

    Harbin Institute of Technology, Harbin, People's Republic of China,Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School Shenzhen, People's Republic of China;

    Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School Shenzhen, People's Republic of China;

    Harbin Institute of Technology, Harbin, People's Republic of China,Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School Shenzhen, People's Republic of China;

    Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen Graduate School Shenzhen, People's Republic of China;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 信息处理(信息加工);
  • 关键词

    topic detection; topic model; clustering; distance measure;

    机译:话题检测;主题模型;集群距离量度;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号