Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words

机译：使用单词的分布式表示形式的短文本的无监督主题建模

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present an unsupervised topic model for short texts that performs soft clustering over distributed representations of words. We model the low-dimensional semantic vector space represented by the dense distributed representations of words using Gaussian mixture models (GMMs) whose components capture the notion of latent topics. While conventional topic modeling schemes such as probabilistic latent semantic analysis (pLSA) and latent Dirich-let allocation (LDA) need aggregation of short messages to avoid data sparsity in short documents, our framework works on large amounts of raw short texts (billions of words). In contrast with other topic modeling frameworks that use word cooccurrence statistics, our framework uses a vector space model that overcomes the issue of sparse word co-occurrence patterns. We demonstrate that our framework outperforms LDA on short texts through both subjective and objective evaluation. We also show the utility of our framework in learning topics and classifying short texts on Twitter data for English, Spanish, French, Portuguese and Russian.

机译：我们为短文本提供了一个无监督的主题模型，该模型对单词的分布式表示执行软聚类。我们使用其成分捕获潜在主题概念的高斯混合模型（GMM）对由单词的密集分布表示表示的低维语义向量空间进行建模。尽管常规的主题建模方案（如概率潜在语义分析（pLSA）和潜在狄利克-莱分配（LDA））需要聚合短消息以避免短文档中的数据稀疏，但我们的框架仍在处理大量原始短文本（数十亿个单词））。与其他使用单词共现统计的主题建模框架相比，我们的框架使用矢量空间模型来克服稀疏单词共现模式的问题。我们证明，通过主观和客观评估，我们的框架在简短文本方面优于LDA。我们还将展示我们的框架在学习主题和对Twitter数据上的短文本进行英语，西班牙语，法语，葡萄牙语和俄语分类的实用性。

著录项

来源
《1st Workshop on vector space Modeling for Natural Language Processing 2015》|2015年|192-200|共9页
会议地点 Denver CO(US)
作者
Vivek Kumar Rangarajan Sridhar;
展开▼
作者单位

ATT Labs 1 ATT Way, Bedminster, NJ 07920;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Improving Distributed Word Representation and Topic Model by Word-Topic Mixture Model [J] . Xianghua Fu, Ting Wang, Jing Li, JMLR: Workshop and Conference Proceedings . 2017,第1期

机译：通过词-主题混合模型改进分布式词表示和主题模型
2. Improving short text classification by learning vector representations of both words and hidden topics [J] . Zhang Heng, Zhong Guoqiang Knowledge-Based Systems . 2016,第juna15期

机译：通过学习单词和隐藏主题的向量表示来改善短文本分类
3. Relational Biterm Topic Model: Short-Text Topic Modeling using Word Embeddings [J] . Li Ximing, Zhang Ang, Li Changchun, The Computer journal . 2019,第3期

机译：关系双项主题模型：使用词嵌入的短文本主题建模
4. Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words [C] . Vivek Kumar Rangarajan Sridhar Workshop on vector space Modeling for Natural Language Processing . 2015

机译：使用分布式表示单词的短文本的无监督主题建模
5. Things and Strings and More: Improving Place Name Disambiguation from Short Texts by Combining Entity Co-Occurrence, Topic Modeling, and Word Embedding [D] . Ju, Yiting. 2017

机译：事物和字符串和更多：通过组合实体共同发生，主题建模和单词嵌入来改善从短文本的歧义
6. Unsupervised Topic Modeling in a Large Free Text Radiology Report Repository [O] . Saeed Hassanpour, Curtis P. Langlotz 2016

机译：大型自由文本放射学报告资料库中的无监督主题建模
7. Vector Representation of Words for Detecting Topic Trends over Short Texts [O] . Liyan He, Yajun Du, Lei Zhang 2018

机译：导航侦查主题趋势的词的传染媒介表示在短篇文本

Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words

摘要

著录项

相似文献

相关主题

期刊订阅