...
首页> 外文期刊>Data technologies and applications >SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings
【24h】

SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings

机译:SenU-PTM:小说phrase-based主题模型简短文本主题发现利用词

获取原文
获取原文并翻译 | 示例
           

摘要

Topic model has been widely applied to discover important information from a vast amount of unstructured data. Traditional long-text topic models such as Latent Dirichlet Allocation may suffer from the sparsity problem when dealing with short texts, which mostly come from the Web. These models also exist the readability problem when displaying the discovered topics. The purpose of this paper is to propose a novel model called the Sense Unit based Phrase Topic Model (SenU-PTM) for both the sparsity and readability problems. Design/methodology/approach: SenU-PTM is a novel phrase-based short-text topic model under a two-phase framework. The first phase introduces a phrase-generation algorithm by exploiting word embeddings, which aims to generate phrases with the original corpus. The second phase introduces a new concept of sense unit, which consists of a set of semantically similar tokens for modeling topics with token vectors generated in the first phase. Finally, SenU-PTM infers topics based on the above two phases. Findings: Experimental results on two real-world and publicly available datasets show the effectiveness of SenU-PTM from the perspectives of topical quality and document characterization. It reveals that modeling topics on sense units can solve the sparsity of short texts and improve the readability of topics at the same time. Originality/value: The originality of SenU-PTM lies in the new procedure of modeling topics on the proposed sense units with word embeddings for short-text topic discovery.
机译:主题模型已经广泛应用于发现从大量的重要信息非组织性数据如潜在狄利克雷分配模型稀疏的问题打交道时较短的文本,主要来自网络。这些模型也存在可读性问题当显示发现话题。本文的目的是提出一个新的模型被称为单元基于短语的话题模型(SenU-PTM)稀疏和可读性问题。是一种新型phrase-based简短文本主题模型在一个两阶段的框架。介绍了一种由phrase-generation算法利用嵌入的,旨在生成短语与原来的语料库。第二阶段引入了一个新的意义上的概念由一组语义单位类似的标记话题建模与令牌在第一阶段生成向量。基于上述两个SenU-PTM推断主题阶段。真实的和公开的数据集SenU-PTM从的有效性视角的局部质量和文档鉴定。在单位可以解决稀疏的短文本的主题,提高可读性同一时间。SenU-PTM在于新过程的建模主题与词提出的意义单位嵌入的简短文本话题发现。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号