Probabilistic Approach for Embedding Arbitrary Features of Text

机译：嵌入文本任意特征的概率方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Topic modeling is usually used to model words in documents by probabilistic mixtures of topics. We generalize this setup and consider arbitrary features of the positions in a corpus, e.g. 'contains a word', 'belongs to a sentence', 'has a word in the local context', 'is labeled with a POS-tag', etc. We build sparse probabilistic embeddings for positions and derive embeddings for the features by averaging of those. Importantly, we interpret the EM-algorithm as an iterative process of intersection and averaging steps that reestimate position and feature embeddings respectively. With this approach, we get several insights. First, we argue that a sentence should not be represented as an average of its words. While each word is a mixture of multiple senses, each word occurrence refers typically to just one specific sense. So in our approach, we obtain sentence embeddings by averaging position embeddings from the E-step. Second, we show that Biterm Topic Model (Yan et al.]) and Word Network Topic Model (Zuo et al.) are equivalent with the only difference of tying word and context embeddings. We further extend these models by adjusting representation of each sliding window with a few iterations of EM-algorithm. Finally, we aim at consistent embeddings for hierarchical entities, e.g. for word-sentence-document structure. We discuss two alternative schemes of training and generalize to the case where the middle level of the hierarchy is unknown. It provides a unified formulation for topic segmentation and word sense disambiguation tasks.

机译：主题建模通常用于通过主题的概率混合为文档中的单词建模。我们对此设置进行概括并考虑语料库中位置的任意特征，例如“包含一个单词”，“属于一个句子”，“在本地上下文中有一个单词”，“带有POS标签标记”等。我们为位置建立稀疏概率嵌入，并通过平均得出特征的嵌入那些。重要的是，我们将EM算法解释为相交和平均步骤的迭代过程，分别对位置和特征嵌入进行重新估计。通过这种方法，我们可以获得一些见解。首先，我们认为一个句子不应代表其单词的平均值。虽然每个单词都是多种意义的混合，但每个单词的出现通常仅指一种特定的意义。因此，在我们的方法中，我们通过平均E步中的位置嵌入来获得句子嵌入。其次，我们证明了Biterm主题模型（Yan等人）和Word网络主题模型（Zuo等人）在绑词和上下文嵌入方面的唯一区别是等效的。通过使用EM算法的几次迭代调整每个滑动窗口的表示，我们进一步扩展了这些模型。最后，我们针对层次实体的一致嵌入，例如用于单词句子文档结构。我们讨论了两种可选的训练方案，并将其推广到层次结构中间层未知的情况。它为主题细分和单词义消歧任务提供了统一的表述。

著录项

来源
《International conference on analysis of Images, social networks and texts》|2018年|134-140|共7页
会议地点 Moscow(RU)
作者
Anna Potapenko;
展开▼
作者单位

National Research University Higher School of Economics Moscow Russia;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Topic models; Word embeddings; EM-algorithm;

机译：主题模型；单词嵌入； EM算法;

相似文献

外文文献
中文文献
专利

1. Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability [J] . Luke Miratrix, Robin Ackerman Statistical Analysis and Data Mining . 2016,第6期

机译：对文本语料库中的任意长短语进行稀疏特征选择，重点是可解释性
2. Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability [J] . Miratrix Luke, Ackerman Robin Statistical Analysis and Data Mining . 2016,第6期

机译：在文本语料库中任意长的短语进行稀疏特征选择，重点是可解释性
3. Clustering and probabilistic matching of arbitrarily shaped ceiling features for monocular vision-based SLAM [J] . Seo-Yeon Hwang, Jae-Bok Song Advanced Robotics: The International Journal of the Robotics Society of Japan . 2013,第9a10期

机译：基于单眼视觉的SLAM的任意形状天花板特征的聚类和概率匹配
4. Probabilistic Approach for Embedding Arbitrary Features of Text [C] . Anna Potapenko International Conference on Analysis of Images, Social Networks, and Texts . 2018

机译：嵌入文本任意特征的概率方法
5. Probabilistic Topic Modeling and Classification Probabilistic PCA for Text Corpora. [D] . Cheng, Chi Wa. 2011

机译：文本主题的概率主题建模和分类概率PCA。
6. A Method of Short Text Representation Based on the Feature Probability Embedded Vector [O] . Wanting Zhou, Hanbin Wang, Hongguang Sun, 2019

机译：基于特征概率嵌入向量的短文本表示方法
7. A Probabilistic Model for Joint Learning of Word Embeddings from Texts and Images [O] . Melissa Ailem, Bowen Zhang, Aurelien Bellet, 2018

机译：文本和图像中联合学习单词嵌入的概率模型

Probabilistic Approach for Embedding Arbitrary Features of Text

摘要

著录项

相似文献

相关主题

期刊订阅