...
首页> 外文期刊>Expert Systems >Nested variational autoencoder for topic modelling on microtexts with word vectors
【24h】

Nested variational autoencoder for topic modelling on microtexts with word vectors

机译:嵌套变分AutoEncoder主题建模与单词向量的Microtexts

获取原文
获取原文并翻译 | 示例
           

摘要

Most of the information on the Internet is represented in the form ofmicrotexts, which are short text snippets such as news headlines or tweets. These sources of information are abundant, and mining these data could uncover meaningful insights. Topic modelling is one of the popular methods to extract knowledge from a collection of documents; however, conventional topic models such as latent Dirichlet allocation (LDA) are unable to perform well on short documents, mostly due to the scarcity of word co-occurrence statistics embedded in the data. The objective of our research is to create a topic model that can achieve great performances on microtexts while requiring a small runtime for scalability to large datasets. To solve the lack of information of microtexts, we allow our method to take advantage of word embeddings for additional knowledge of relationships between words. For speed and scalability, we apply autoencoding variational Bayes, an algorithm that can perform efficient black-box inference in probabilistic models. The result of our work is a novel topic model called thenested variational autoencoder, which is a distribution that takes into account word vectors and is parameterized by a neural network architecture. For optimization, the model is trained to approximate the posterior distribution of the original LDA model. Experiments show the improvements of our model on microtexts as well as its runtime advantage.
机译:Internet上的大多数信息以MICROTEXTS的形式表示,这是短文本片段,例如新闻标题或推文。这些信息来源丰富,挖掘这些数据可能会发现有意义的见解。主题建模是从文件集合中提取知识的流行方法之一;但是,诸如潜在Dirichlet分配(LDA)之类的常规主题模型在短文档中无法执行良好,主要是由于数据中嵌入的单词共同发生统计信息的稀缺性。我们的研究目的是创建一个主题模型,可以在需要小型运行时对MicroTexts实现很大的性能,以便可扩展到大型数据集。为了解决MicroTexts的缺乏信息,我们允许我们的方法利用Word Embeddings,以便额外了解单词之间的关系。为了速度和可扩展性,我们应用自动编码变分贝斯,这是一个可以在概率模型中执行有效的黑箱推断的算法。我们的作品结果是一个名为Daliational AutoEncoder的新型主题模型,它是考虑字向量的分布,并由神经网络架构参数化。为了优化,培训模型以近似原始LDA模型的后部分布。实验表明我们对Microtexts模型的改进以及其运行时优势。

著录项

  • 来源
    《Expert Systems》 |2021年第2期|e12639.1-e12639.27|共27页
  • 作者

    Trinh Trung; Quan Tho; Mai Trung;

  • 作者单位

    Ho Chi Minh City Univ Technol Fac Comp Sci & Engn Ho Chi Minh Vietnam;

    Ho Chi Minh City Univ Technol Fac Comp Sci & Engn Ho Chi Minh Vietnam;

    Ho Chi Minh City Univ Technol Fac Comp Sci & Engn Ho Chi Minh Vietnam;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    microtext; neural network; topic modelling; variational autoencoder; word embedding;

    机译:MicroText;神经网络;主题建模;变形式自动化器;单词嵌入;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号