首页> 外文会议>International conference on analysis of Images, social networks and texts >Probabilistic Approach for Embedding Arbitrary Features of Text
【24h】

Probabilistic Approach for Embedding Arbitrary Features of Text

机译:嵌入文本任意特征的概率方法

获取原文

摘要

Topic modeling is usually used to model words in documents by probabilistic mixtures of topics. We generalize this setup and consider arbitrary features of the positions in a corpus, e.g. 'contains a word', 'belongs to a sentence', 'has a word in the local context', 'is labeled with a POS-tag', etc. We build sparse probabilistic embeddings for positions and derive embeddings for the features by averaging of those. Importantly, we interpret the EM-algorithm as an iterative process of intersection and averaging steps that reestimate position and feature embeddings respectively. With this approach, we get several insights. First, we argue that a sentence should not be represented as an average of its words. While each word is a mixture of multiple senses, each word occurrence refers typically to just one specific sense. So in our approach, we obtain sentence embeddings by averaging position embeddings from the E-step. Second, we show that Biterm Topic Model (Yan et al.]) and Word Network Topic Model (Zuo et al.) are equivalent with the only difference of tying word and context embeddings. We further extend these models by adjusting representation of each sliding window with a few iterations of EM-algorithm. Finally, we aim at consistent embeddings for hierarchical entities, e.g. for word-sentence-document structure. We discuss two alternative schemes of training and generalize to the case where the middle level of the hierarchy is unknown. It provides a unified formulation for topic segmentation and word sense disambiguation tasks.
机译:主题建模通常用于通过主题的概率混合为文档中的单词建模。我们对此设置进行概括并考虑语料库中位置的任意特征,例如“包含一个单词”,“属于一个句子”,“在本地上下文中有一个单词”,“带有POS标签标记”等。我们为位置建立稀疏概率嵌入,并通过平均得出特征的嵌入那些。重要的是,我们将EM算法解释为相交和平均步骤的迭代过程,分别对位置和特征嵌入进行重新估计。通过这种方法,我们可以获得一些见解。首先,我们认为一个句子不应代表其单词的平均值。虽然每个单词都是多种意义的混合,但每个单词的出现通常仅指一种特定的意义。因此,在我们的方法中,我们通过平均E步中的位置嵌入来获得句子嵌入。其次,我们证明了Biterm主题模型(Yan等人)和Word网络主题模型(Zuo等人)在绑词和上下文嵌入方面的唯一区别是等效的。通过使用EM算法的几次迭代调整每个滑动窗口的表示,我们进一步扩展了这些模型。最后,我们针对层次实体的一致嵌入,例如用于单词句子文档结构。我们讨论了两种可选的训练方案,并将其推广到层次结构中间层未知的情况。它为主题细分和单词义消歧任务提供了统一的表述。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号