首页> 外文会议>1st Workshop on vector space Modeling for Natural Language Processing 2015 >Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words
【24h】

Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words

机译:使用单词的分布式表示形式的短文本的无监督主题建模

获取原文
获取原文并翻译 | 示例

摘要

We present an unsupervised topic model for short texts that performs soft clustering over distributed representations of words. We model the low-dimensional semantic vector space represented by the dense distributed representations of words using Gaussian mixture models (GMMs) whose components capture the notion of latent topics. While conventional topic modeling schemes such as probabilistic latent semantic analysis (pLSA) and latent Dirich-let allocation (LDA) need aggregation of short messages to avoid data sparsity in short documents, our framework works on large amounts of raw short texts (billions of words). In contrast with other topic modeling frameworks that use word cooccurrence statistics, our framework uses a vector space model that overcomes the issue of sparse word co-occurrence patterns. We demonstrate that our framework outperforms LDA on short texts through both subjective and objective evaluation. We also show the utility of our framework in learning topics and classifying short texts on Twitter data for English, Spanish, French, Portuguese and Russian.
机译:我们为短文本提供了一个无监督的主题模型,该模型对单词的分布式表示执行软聚类。我们使用其成分捕获潜在主题概念的高斯混合模型(GMM)对由单词的密集分布表示表示的低维语义向量空间进行建模。尽管常规的主题建模方案(如概率潜在语义分析(pLSA)和潜在狄利克-莱分配(LDA))需要聚合短消息以避免短文档中的数据稀疏,但我们的框架仍在处理大量原始短文本(数十亿个单词) )。与其他使用单词共现统计的主题建模框架相比,我们的框架使用矢量空间模型来克服稀疏单词共现模式的问题。我们证明,通过主观和客观评估,我们的框架在简短文本方面优于LDA。我们还将展示我们的框架在学习主题和对Twitter数据上的短文本进行英语,西班牙语,法语,葡萄牙语和俄语分类的实用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号