首页> 外文学位 >Topic models for tagged text.
【24h】

Topic models for tagged text.

机译:标记文本的主题模型。

获取原文
获取原文并翻译 | 示例

摘要

Our world has been experiencing a dramatic and continually increasing growth of digital textual information. This phenomenon raises challenges in analyzing, understanding, organizing, and summarizing these large bodies of textual information. A large portion of the textual information contains meta-data, such as user-annotated tags, which provides useful information and could help improve the current text mining results. Thus, this thesis focuses on handling tagged text using topic modeling techniques.;We start from the Latent Dirichlet Allocation (LDA) model and introduce a Trivial Tag-Latent Dirichlet Allocation (TriTag-LDA) model, which directly connects the tags to the topics via an improved two-layer LDA model. Specifically, the bottom layer is the standard LDA, while the upper layer is a constrained LDA with the topics coming from the bottom layer. After that, we propose a new topic model, Tag-Latent Dirichlet Allocation (Tag-LDA), which more naturally integrates tags into the generative process. In Tag-LDA, a document is viewed as a mixture of tags rather than topics, and topics are generated from multinomial distributions under tags. TriTag-LDA and Tag-LDA bridge the user-generated tags and the latent topics. In both these models, a tag is described in the form of a mixture of shared topics. This representation enables the analysis of the relationships between tags. We provide quantitative and qualitative comparisons between our models and related work, and show that Tag-LDA is superior under the perplexity criterion. We also apply Tag-LDA to explain hashtags on Twitter and discover their relationships.;We then develop two extensions of Tag-LDA: Tag-Dirichlet Processes (Tag-LDP) and Tag-Dirichlet Allocation with concepts (ConceptTag-LDA). Tag-LDP utilizes the Dirichlet process in modeling so that the number of topics can be decided automatically based on the data. Our experiments demonstrate that Tag-LDP can infer the number of topics from the data and that the quality of topics is as good as Tag-LDA. ConceptTag-LDA provides a mechanism where users' prior knowledge can be incorporated in learning the topics. Users' knowledge represented as pre-defined concepts is modeled through the Dirichlet Tree prior which replaces the original Dirichlet prior in Tag-LDA. Our experiments study the influence of the concepts on the topics, and demonstrate that the input concepts can influence the topics toward users' prior knowledge.;Finally we present the dynamic Twitter topic model (DTTM), a specialized temporal topic model tailored for the short messages in social media. On social media such as Twitter, people's discussions are constantly evolving with many discussions centering around events. A major event usually involves twists and turns reflected by multiple sub-events as it develops over time. This temporal event development is in turn reflected by people's discussions on Twitter. In DTTM, we assume an event can be modeled by a mainstream topic plus several facets and that each tweet is a mixture of two topics: the mainstream topic and one facet topic. To capture the temporal dynamics of the discussions, DTTM models the temporal evolution of the mainstream topic and the facet topics. To demonstrate the effectiveness of DTTM in modeling the temporal dynamics of topics, we did two case studies with our model using Twitter data and show that our model performs better in summarizing the discussions than existing topic models.
机译:我们的世界一直在经历着数字文本信息的戏剧性且持续增长的增长。这种现象在分析,理解,组织和总结这些大量的文本信息方面提出了挑战。文本信息的很大一部分包含元数据,例如用户注释的标签,它提供有用的信息并有助于改善当前的文本挖掘结果。因此,本论文着重于使用主题建模技术处理标记文本。;我们从潜在狄利克雷分配(LDA)模型开始,并引入了琐碎的标签-潜在狄利克雷分配(TriTag-LDA)模型,该模型将标记直接连接到主题通过改进的两层LDA模型。具体而言,底层是标准LDA,而上层是受约束的LDA,主题来自底层。之后,我们提出了一个新的主题模型,即标签潜在狄利克雷分配(Tag-LDA),它可以更自然地将标签整合到生成过程中。在Tag-LDA中,文档被视为标签而不是主题的混合,并且主题是根据标签下的多项式分布生成的。 TriTag-LDA和Tag-LDA桥接了用户生成的标签和潜在主题。在这两种模型中,标记都是以共享主题的混合形式描述的。这种表示使得能够分析标签之间的关系。我们提供了我们的模型与相关工作之间的定量和定性比较,并表明Tag-LDA在困惑度标准下是优越的。我们还应用Tag-LDA来解释Twitter上的主题标签并发现它们之间的关系。;然后,我们开发了Tag-LDA的两个扩展:具有概念的Tag-Dirichlet流程(Tag-LDP)和Tag-Dirichlet分配(ConceptTag-LDA)。 Tag-LDP利用Dirichlet过程进行建模,因此可以根据数据自动确定主题数。我们的实验表明,Tag-LDP可以从数据中推断出主题的数量,并且主题的质量与Tag-LDA一样好。 ConceptTag-LDA提供了一种机制,可以将用户的先验知识纳入学习主题中。代表预定义概念的用户知识是通过Dirichlet树建模的,该树取代了Tag-LDA中的原始Dirichlet树。我们的实验研究了概念对主题的影响,并证明了输入的概念可以影响主题对用户的先验知识。最后,我们提出了动态Twitter主题模型(DTTM),这是专门为短篇小说量身定制的社交媒体中的消息。在诸如Twitter之类的社交媒体上,人们的讨论不断发展,许多讨论都围绕事件进行。重大事件通常会随着时间的推移而涉及多个子事件所反映的曲折。人们在Twitter上的讨论反过来反映了这种时间性事件的发展。在DTTM中,我们假设一个事件可以由一个主流主题加上几个方面来建模,并且每个推文都是两个主题的组合:主流主题和一个方面主题。为了捕获讨论的时间动态,DTTM对主流主题和分面主题的时间演变进行建模。为了证明DTTM在建模主题的时间动态方面的有效性,我们使用Twitter数据对模型进行了两个案例研究,并表明与现有主题模型相比,我们的模型在总结讨论方面的表现更好。

著录项

  • 作者

    Ma, Zhiqiang.;

  • 作者单位

    The University of North Carolina at Charlotte.;

  • 授予单位 The University of North Carolina at Charlotte.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 133 p.
  • 总页数 133
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号