首页> 外文会议> >Investigation of the Quality of Topic Models for Noisy Data Sources
【24h】

Investigation of the Quality of Topic Models for Noisy Data Sources

机译:嘈杂数据源主题模型的质量调查

获取原文

摘要

Latent Dirichlet Allocation (LDA) has become the most stable and widely used topic model to derive topics from collections of documents where it depicts different levels of success based on diversified domains of inputs. Nevertheless, it is a vital requirement to evaluate the LDA against the quality of the input. The noise and uncertainty of the content create a negative influence on the topic model. The major contribution of this investigation is to critically evaluate the LDA based on the quality of input sources and human perception. The empirical study shows the relationship between the quality of the input and the accuracy of the output generated by LDA. Perplexity and coherence have been evaluated with three data-sets (RCV1, conference data set, tweets) which contain different level of complexities and uncertainty in their contents. Human perception in generating topics has been compared with the LDA in terms of human defined topics. Results of the analysis demonstrate a strong relationship between the quality of the input and generated topics. Thus, highly relevant topics were generated from formally written contents while noisy and messy contents lead to generate meaningless topics. A considerable gap is noticed between human defined topics and LDA generated topics. Finally, a concept-based topic modeling technique is proposed to improve the quality of topics by capturing the meaning of the content and eliminating the irrelevant and meaningless topics.
机译:潜在狄利克雷分配(LDA)已成为最稳定和广泛使用的主题模型,用于从文档集合中派生主题,该文档模型根据输入的不同领域描述了不同程度的成功。但是,对LDA的输入质量进行评估是至关重要的。内容的噪音和不确定性对主题模型产生负面影响。这项调查的主要贡献是根据输入源的质量和人类的感知来严格评估LDA。实证研究表明,LDA产生的输入质量和输出精度之间的关系。已使用三个数据集(RCV1,会议数据集,tweet)评估了困惑和连贯性,这三个数据集包含不同程度的复杂性和内容不确定性。就人类定义的话题而言,人们在产生话题方面的感知已与LDA进行了比较。分析结果表明,输入的质量和生成的主题之间有很强的关系。因此,从正式书面内容中生成了高度相关的主题,而嘈杂和凌乱的内容导致生成了毫无意义的主题。在人类定义的主题和LDA生成的主题之间发现了相当大的差距。最后,提出了一种基于概念的主题建模技术,通过捕获内容的含义并消除无关紧要的主题来提高主题的质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号