首页> 外文期刊>Information Processing & Management >Vocabulary size and its effect on topic representation
【24h】

Vocabulary size and its effect on topic representation

机译:词汇量及其对主题表达的影响

获取原文
获取原文并翻译 | 示例
       

摘要

This study investigates how computational overhead for topic model training may be reduced by selectively removing terms from the vocabulary of text corpora being modeled. We compare the impact of removing singly occurring terms, the top 0.5%, 1% and 5% most frequently occurring terms and both top 0.5% most frequent and singly occurring terms, along with changes in the number of topics modeled (10, 20, 30, 40, 50, 100) using three datasets. Four outcome measures are compared. The removal of singly occurring terms has little impact on outcomes for all of the measures tested. Document discriminative capacity, as measured by the document space density, is reduced by the removal of frequently occurring terms, but increases with higher numbers of topics. Vocabulary size does not greatly influence entropy, but entropy is affected by the number of topics. Finally, topic similarity, as measured by pairwise topic similarity and Jensen-Shannon divergence, decreases with the removal of frequent terms. The findings have implications for information science research in information retrieval and informetrics that makes use of topic modeling.
机译:这项研究调查了如何通过选择性地从正在建模的文本语料库的词汇表中删除术语来减少主题模型训练的计算量。我们比较了删除单独出现的字词,最常见的0.5%,1%和5%的最常见字词以及最频繁和单独出现的0.5%的字词所产生的影响,以及建模主题数量的变化(10、20, 30、40、50、100)使用三个数据集。比较了四个结果指标。删除单个出现的术语对所有测试指标的结果几乎没有影响。通过消除经常出现的术语,可以减少通过文档空间密度衡量的文档区分能力,但是随着主题数量的增加,文档区分能力会增加。词汇量不会极大地影响熵,但是熵会受到主题数量的影响。最后,以成对主题相似度和Jensen-Shannon散度衡量的主题相似度随着频繁项的去除而降低。这些发现对利用主题建模的信息检索和信息计量学中的信息科学研究产生了启示。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号