首页> 外文会议>Workshop on vector space Modeling for Natural Language Processing >Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words
【24h】

Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words

机译:使用分布式表示单词的短文本的无监督主题建模

获取原文
获取外文期刊封面目录资料

摘要

We present an unsupervised topic model for short texts that performs soft clustering over distributed representations of words. We model the low-dimensional semantic vector space represented by the dense distributed representations of words using Gaussian mixture models (GMMs) whose components capture the notion of latent topics. While conventional topic modeling schemes such as probabilistic latent semantic analysis (pLSA) and latent Dirich-let allocation (LDA) need aggregation of short messages to avoid data sparsity in short documents, our framework works on large amounts of raw short texts (billions of words). In contrast with other topic modeling frameworks that use word cooccurrence statistics, our framework uses a vector space model that overcomes the issue of sparse word co-occurrence patterns. We demonstrate that our framework outperforms LDA on short texts through both subjective and objective evaluation. We also show the utility of our framework in learning topics and classifying short texts on Twitter data for English, Spanish, French, Portuguese and Russian.
机译:我们为短文本提供了一个无人监督的主题模型,用于在单词的分布式表示中执行软群。我们使用高斯混合模型(GMMS)模拟了由Leussian混合模型(GMMS)的密集分布式表示的低维语义矢量空间,其组件捕获潜在主题的概念。虽然常规主题建模方案如概率潜在语义分析(PLSA)和潜在的Dirich-Let分配(LDA)需要聚合短消息,以避免短文档中的数据稀疏性,我们的框架在大量的原始短文本上工作(数十亿字)。与使用Word Cooccurrence统计数据的其他主题建模框架相比,我们的框架使用了一个克服了稀疏字共有模式问题的传染媒介空间模型。我们证明我们的框架通过主观和客观评估来表明我们的框架在短篇文本上表现出LDA。我们还展示了我们在学习主题和对英语,西班牙语,法语,葡萄牙语和俄语的Twitter数据上进行了分类的简短文本的效用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号