首页> 外文会议>IEEE International Conference on Big Data >Search for K: Assessing Five Topic-Modeling Approaches to 120,000 Canadian Articles
【24h】

Search for K: Assessing Five Topic-Modeling Approaches to 120,000 Canadian Articles

机译:搜索K:评估五个主题建模方法到120,000个加拿大文章

获取原文

摘要

Topic modeling has been an important field in natural language processing (NLP) and recently witnessed great methodological advances. Yet, the development of topic modeling is still, if not increasingly, challenged by two critical issues. First, despite intense efforts toward nonparametric/post-training methods, the search for the optimal number of topics K remains a fundamental question in topic modeling and warrants input from domain experts. Second, with the development of more sophisticated models, topic modeling is now ironically been treated as a black box and it becomes increasingly difficult to tell how research findings are informed by data, model specifications, or inference algorithms. Based on about 120,000 newspaper articles retrieved from three major Canadian newspapers (Globe and Mail, Toronto Star, and National Post) since 1977, we employ five methods with different model specifications and inference algorithms (Latent Semantic Analysis, Latent Dirichlet Allocation, Principal Component Analysis, Factor Analysis, Nonnegative Matrix Factorization) to identify discussion topics. The optimal topics are then assessed using three measures: coherence statistics, held-out likelihood (loss), and graph-based dimensionality selection. Mixed findings from this research complement advances in topic modeling and provide insights into the choice of optimal topics in social science research.
机译:主题建模是自然语言处理(NLP)中的一个重要领域,最近见证了巨大的方法论进步。然而,主题建模的发展仍然是不越来越多的两个关键问题挑战。首先,尽管对非参数/培训方法进行了强烈的努力,但搜索最佳主题k仍然是主题建模和域专家的权证的基本问题。其次,随着更复杂的模型的发展,主题建模现在被讽刺地被视为黑匣子,并且越来越困难地识别通过数据,模型规范或推理算法通知研究结果。自1977年以来,基于从三大加拿大报纸(全球和邮件,多伦多明星和国家邮政)中检索的约120,000条报纸文章,我们采用了五种不同的模型规范和推理算法(潜在语义分析,潜在的Dirichlet分配,主要成分分析,因子分析,非负矩阵分解)来识别讨论主题。然后使用三种措施评估最佳主题:连贯统计,举出的可能性(损失)和基于图的维度选择。这项研究的混合发现补充了主题建模的进步,并为社会科学研究中最优主题的选择提供了见解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号