首页> 外文会议>IEEE International Conference on Machine Learning and Applications >Unsupervised Topic Model Based Text Network Construction for Learning Word Embeddings
【24h】

Unsupervised Topic Model Based Text Network Construction for Learning Word Embeddings

机译:基于无监督主题模型的文本网络学习词嵌入

获取原文

摘要

Distributed word embeddings have proven remarkably effective at capturing word level semantic and syntactic regularities in language for many natural language processing tasks. One recently proposed semi-supervised representation learning method called Predictive Text Embedding (PTE) utilizes both semantically labeled and unlabeled data in information networks to learn the embedding of text that produces state of-the-art performance when compared to other embedding methods. However, PTE uses supervised label information to construct one of the networks and many other possible ways of constructing such information networks are left untested. We present two unsupervised methods that can be used in constructing a large scale semantic information network from documents by combining topic models that have emerged as a powerful technique of finding useful structure in an unstructured text collection as it learns distributions over words. The first method uses Latent Dirichlet Allocation (LDA) to build a topic model over text, and constructs a word topic network with edge weights proportional to the word-topic probability distributions. The second method trains an unsupervised neural network to learn the word-document distribution, with a single hidden layer representing a topic distribution. The two weight matrices of the neural net are directly reinterpreted as the edge weights of heterogeneous text networks that can be used to train word embeddings to build an effective low dimensional representation that preserves the semantic closeness of words and documents for NLP tasks. We conduct extensive experiments to evaluate the effectiveness of our methods.
机译:事实证明,对于许多自然语言处理任务而言,分布式单词嵌入在捕获语言中的单词级别语义和句法规则方面非常有效。最近提出的一种称为预测文本嵌入(PTE)的半监督表示学习方法,利用信息网络中语义标记和未标记的数据来学习与其他嵌入方法相比具有最新技术性能的文本嵌入。但是,PTE使用监督的标签信息来构建网络之一,而构建此类信息网络的许多其他可能方式未经测试。通过结合主题模型,我们提出了两种可用于从文档构建大规模语义信息网络的无监督方法,这些主题模型已成为一种强大的技术,可以在非结构化文本集合中学习单词的分布,从而在有用的结构中找到有用的结构。第一种方法使用潜在Dirichlet分配(LDA)在文本上构建主题模型,并构造一个边缘权重与单词-主题概率分布成比例的单词主题网络。第二种方法训练一个无监督的神经网络来学习单词文档的分布,其中一个隐藏的层代表一个主题分布。神经网络的两个权重矩阵被直接重新解释为异构文本网络的边缘权重,可用于训练单词嵌入以构建有效的低维表示形式,从而保留单词和文档对NLP任务的语义紧密性。我们进行了广泛的实验,以评估我们方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号