【24h】

Sparse Topical Coding with Sparse Groups

机译:稀疏组的稀疏主题编码

获取原文

摘要

Learning a latent semantic representing from a large number of short text corpora makes a profound practical significance in research and engineering. However, it is difficult to use standard topic models in microblogging environments since microblogs have short length, large amount, snarled noise and irregular modality characters, which prevent topic models from using full information of microblogs. In this paper, we propose a novel non-probabilistic topic model called sparse topical coding with sparse groups (STCSG), which is capable of discovering sparse latent semantic representations of large short text corpora. STCSG relaxes the normalization constraint of the inferred representations with sparse group lasso, a sparsity-inducing regularizer, which is convenient to directly control the sparsity of document, topic and word codes. Furthermore, the relaxed non-probabilistic STCSG can be effectively learned with alternating direction method of multipliers (ADMM). Our experimental results on Twitter dataset demonstrate that STCSG performs well in finding meaningful latent representations of short documents. Therefore, it can substantially improve the accuracy and efficiency of document classification.
机译:从大量的短文本语料库中学习潜在的语义表示在研究和工程中具有深远的现实意义。但是,由于微博的长度短,数量大,噪音大,模态特征不规则,因此很难在微博环境中使用标准主题模型,这会阻止主题模型使用微博的全部信息。在本文中,我们提出了一种新的非概率主题模型,称为带有稀疏组的稀疏主题编码(STCSG),它能够发现大型短文本语料库的稀疏潜在语义表示。 STCSG通过稀疏组套索(sparse group lasso)放宽了推断表示的归一化约束,稀疏组套索导致规则化,方便直接控制文档,主题和单词代码的稀疏性。此外,可以通过乘数的交替方向方法(ADMM)有效地学习松弛的非概率STCSG。我们在Twitter数据集上的实验结果表明,STCSG在寻找有意义的短文档潜在表示方面表现良好。因此,可以大大提高文档分类的准确性和效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号