首页> 外文会议>International Conference on Natural Language Processing and Knowledge Engineering >Dirichlet Process Mixture Models based topic identification for short text streams
【24h】

Dirichlet Process Mixture Models based topic identification for short text streams

机译:Dirichlet过程混合模型基于短文本流的主题识别

获取原文

摘要

Topic detection and tracking (TDT) has been extensively studied and applied in recent years. However, prior work is mostly based on regular news text, the problem of scaling to short stories remains pretty much open. Besides, prior work conducts topic identification on separated stories by assuming story segmentation as prerequisites, which is another challenging yet critical task for TDT research. In this paper, we propose a Dirichlet Process Mixture Model (DPMM) based topic identification method, which deals with topic segmentation, topic detection and tracking in an unified model, and achieves reasonable results for short stories. We first present DPMM and its application in topic identification task. Then we discuss two different solutions specifically designed to solve sparseness problem associated with short stories. One is the design of algorithm flow. Instead of a single short text, the processing unit of topic identification is converted to session firstly. The other applies extended DPMM model which takes account of word dependency when estimating distributions of words associated with every known topic. Whereafter, we extend DPMM to identify topic for spontaneous text streams by managing topic segmentation, topic detection and tracking simultaneously. The attractive advantage of DPMM is the number of mixture components needs not been fixed in advance, and it does not need prior knowledge about number and content of topics. So compared with other existing methods, it is more suitable for streaming topic identification. Our empirical results on TDT3 evaluation data verify that DPMM is valid in the task of topic identification for short text data with stream properties, and extended DPMM outperforms original DPMM methods.
机译:话题检测与跟踪(TDT)已被广泛研究,并在近几年应用。然而,以前的工作主要是基于例行新闻文本,可以扩展至短篇小说的问题仍然非常开放。此外,以前的工作进行假设故事分割为先决条件,这是TDT研究的另一个挑战而又至关重要的任务上分离出来的故事的主题辨别。在本文中,我们提出了一种Dirichlet过程混合模型(DPMM)基于主题的识别方法,其与主题分割,话题检测与在统一模型跟踪交易,并且实现用于短篇合理的结果。我们首先提出DPMM及其主题辨别任务的应用程序。然后,我们讨论了专为解决短篇小说相关稀疏问题的两个不同的解决方案。一是算法流程的设计。代替一个单一的短文本,主题识别所述处理单元被转换为会话首先。其它适用的扩展DPMM模型估计与每个已知主题相关词语的分布时这需要字依赖性的帐户。此后,我们扩展DPMM自发文通过管理主题划分,主题检测,并同时跟踪流识别话题。 DPMM的吸引人的优点是混合分量的数量需求没有被事先固定的,它并不需要有关数量和主题内容的先验知识。因此,与其他现有的方法相比,它更适合流媒体的主题辨别。我们对TDT3评估数据的实证结果验证DPMM是在主题识别与流属性简短的文字数据和扩展DPMM性能优于原DPMM方法任务有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号