首页> 外文期刊>Computer speech and language >Unsupervised sentence representations as word information series: Revisiting TF-IDF
【24h】

Unsupervised sentence representations as word information series: Revisiting TF-IDF

机译:无监督的句子表示形式,如单词信息系列:再访TF-IDF

获取原文
获取原文并翻译 | 示例

摘要

Sentence representation at the semantic level is a challenging task for natural language processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the series are fitted by using Shannon's Mutual Information (MI) among words, sentences and the corpus. In fact, the Term Frequency-Inverse Document Frequency transform (TF-IDF) is a reliable estimate of such MI. Our method offers advantages over existing ones: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and linguistic annotation resources. Results showed that our model, despite its concreteness and low computational cost, was competitive with the state of the art in well-known Semantic Textual Similarity (STS) tasks. (C) 2019 Elsevier Ltd. All rights reserved.
机译:对于自然语言处理和人工智能,语义级别的句子表示是一项艰巨的任务。尽管词嵌入(即,词向量表示)已经取得了进步,但是由于词之间语义交互的复杂性,捕获句子的含义仍然是一个悬而未决的问题。在本文中,我们提出了一种嵌入方法,旨在从无标签文本中学习无监督的句子表示。我们提出了一种无监督的方法,该方法将句子建模为单词嵌入的加权序列。该系列的权重通过在单词,句子和语料库之间使用香农互助信息(MI)进行拟合。实际上,术语“频率逆文档频率转换”(TF-IDF)是此类MI的可靠估计。与现有方法相比,我们的方法具有很多优势:可识别的模块,短期培训,在线(不可见的)句子表示推断以及与域,外部知识和语言注释资源的独立性。结果表明,尽管模型具体,计算成本低,但在众所周知的语义文本相似性(STS)任务中与现有技术相比具有竞争力。 (C)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号