首页> 外文会议>European conference on IR research >Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity
【24h】

Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

机译:用于度量主题多样性的主题模型的分层重新估计

获取原文

摘要

A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three elements for assessing diversity: words, topics, and documents as collections of words. Topic models play a central role in this approach. Using standard topic models for measuring diversity of documents is suboptimal due to generality and impurity. General topics only include common information from a background corpus and are assigned to most of the documents in the collection. Impure topics contain words that are not related to the topic; impurity lowers the interpretability of topic models and impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation approach for topic models to combat generality and impurity; the proposed approach operates at three levels: words, topics, and documents. Our re-estimation approach for measuring documents' topical diversity outperforms the state of the art on PubMed dataset which is commonly used for diversity experiments.
机译:高度的主题多样性通常被认为是有趣的文本文档的重要特征。最近一项衡量话题多样性的建议确定了评估多样性的三个要素:单词,主题和作为单词集合的文档。主题模型在这种方法中起着核心作用。由于通用性和杂质,使用标准主题模型来度量文档的多样性是次优的。常规主题仅包括来自背景语料库的常见信息,并分配给馆藏中的大多数文档。不纯的主题包含与主题无关的词;杂质会降低主题模型的可解释性,并且不正确的主题很可能会错误地分配给文档。我们为主题模型提出了一种层次化的重新估计方法,以对抗普遍性和杂质。提议的方法在三个层面上起作用:单词,主题和文档。我们用于测量文档主题多样性的重新估算方法优于在多样性实验中常用的PubMed数据集上的最新技术水平。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号