首页> 外文会议>7th International Conference on Natural Language Processing and Knowledge Engineering >Dirichlet Process Mixture Models based topic identification for short text streams
【24h】

Dirichlet Process Mixture Models based topic identification for short text streams

机译:基于Dirichlet过程混合模型的短文本流主题识别

获取原文
获取原文并翻译 | 示例

摘要

Topic detection and tracking (TDT) has been extensively studied and applied in recent years. However, prior work is mostly based on regular news text, the problem of scaling to short stories remains pretty much open. Besides, prior work conducts topic identification on separated stories by assuming story segmentation as prerequisites, which is another challenging yet critical task for TDT research. In this paper, we propose a Dirichlet Process Mixture Model (DPMM) based topic identification method, which deals with topic segmentation, topic detection and tracking in an unified model, and achieves reasonable results for short stories. We first present DPMM and its application in topic identification task. Then we discuss two different solutions specifically designed to solve sparseness problem associated with short stories. One is the design of algorithm flow. Instead of a single short text, the processing unit of topic identification is converted to session firstly. The other applies extended DPMM model which takes account of word dependency when estimating distributions of words associated with every known topic. Whereafter, we extend DPMM to identify topic for spontaneous text streams by managing topic segmentation, topic detection and tracking simultaneously. The attractive advantage of DPMM is the number of mixture components needs not been fixed in advance, and it does not need prior knowledge about number and content of topics. So compared with other existing methods, it is more suitable for streaming topic identification. Our empirical results on TDT3 evaluation data verify that DPMM is valid in the task of topic identification for short text data with stream properties, and extended DPMM outperforms original DPMM methods.
机译:近年来,主题检测和跟踪(TDT)已被广泛研究和应用。但是,先前的工作主要基于常规新闻文本,因此扩展为短篇小说的问题仍然很悬而未决。此外,先前的工作通过将故事分段作为前提来对分离的故事进行主题识别,这是TDT研究的又一个挑战性但至关重要的任务。在本文中,我们提出了一种基于狄利克雷混合模型(DPMM)的主题识别方法,该方法以统一的模型处理主题分割,主题检测和跟踪,并为短篇小说取得了合理的结果。我们首先介绍DPMM及其在主题识别任务中的应用。然后,我们讨论专门设计用于解决与短故事相关的稀疏问题的两种不同解决方案。一种是算法流程的设计。代替单个短文本,主题识别的处理单元首先被转换为会话。另一个应用扩展的DPMM模型,该模型在估计与每个已知主题关联的单词分布时考虑单词依赖性。之后,我们扩展DPMM以通过同时管理主题分割,主题检测和跟踪来识别自发文本流的主题。 DPMM的吸引人的优点是不需要预先确定混合物成分的数量,也不需要有关主题数量和内容的先验知识。因此,与其他现有方法相比,它更适合于流主题识别。我们在TDT3评估数据上的经验结果验证了DPMM在具有流属性的短文本数据的主题识别任务中是有效的,并且扩展的DPMM优于原始的DPMM方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号