首页> 外文会议>Mexican conference on pattern recognition >Sampled Weighted Min-Hashing for Large-Scale Topic Mining
【24h】

Sampled Weighted Min-Hashing for Large-Scale Topic Mining

机译:大规模主题挖掘的采样加权最小哈希

获取原文

摘要

We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification.
机译:我们提出了采样加权最小哈希(SWMH),这是一种从大型语料库中自动挖掘主题的随机方法。 SWMH根据术语共现产生语料库词汇的多个随机分区,并将高度重叠的分区间单元聚集在一起以产生挖掘的主题。虽然其他方法将主题定义为词汇上的概率分布,但SWMH主题是此类词汇的有序子集。有趣的是,SWMH挖掘的主题是来自不同级别的语料库主题的基础。我们在NIPS(1.7K个文档),20个新闻组(20K),路透社(800K)和Wikipedia(4M)语料库上定性和定量地广泛评估了所开采主题的意义。此外,我们将SWMH与在线LDA主题的质量进行比较,以进行分类中的文档表示。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号