Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding

机译：通过蒸馏句嵌入的可扩展细心句子建模

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recent state-of-the-art natural language understanding models, such as BERT and XLNet, score a pair of sentences (A and B) using multiple cross-attention operations - a process in which each word in sentence A attends to all words in sentence B and vice versa. As a result, computing the similarity between a query sentence and a set of candidate sentences, requires the propagation of all query-candidate sen-tence-pairs throughout a stack of cross-attention layers. This exhaustive process becomes computationally prohibitive when the number of candidate sentences is large. In contrast, sentence embedding techniques learn a sentence-to-vector mapping and compute the similarity between the sentence vectors via simple elementary operations. In this paper, we introduce Distilled Sentence Embedding (DSE) - a model that is based on knowledge distillation from cross-attentive models, focusing on sentence-pair tasks. The outline of DSE is as follows: Given a cross-attentive teacher model (e.g. a fine-tuned BERT), we train a sentence embedding based student model to reconstruct the sentence-pair scores obtained by the teacher model. We empirically demonstrate the effectiveness of DSE on five GLUE sen-tence-pair tasks. DSE significantly outperforms several ELMO variants and other sentence embedding methods, while accelerating computation of the query-candidate sen-tence-pairs similarities by several orders of magnitude, with an average relative degradation of 4.6% compared to BERT. Furthermore, we show that DSE produces sentence embeddings that reach state-of-the-art performance on universal sentence representation benchmarks. Our code is made publicly available at https://github.com/microsoft/Distilled-Sentence-Embedding.

机译：最近的最先进的自然语言理解模型，例如BERT和XLNET，使用多个横向操作来分级一对句子（A和B） - 一个过程中句子中的每个单词都参加所有单词句子B反之亦然。结果，计算查询句子和一组候选句子之间的相似性，需要在跨关注层的堆叠中传播所有查询候选森格对。当候选句子的数量大时，这种详尽的过程变得越来越高。相比之下，句子嵌入技术学习句子到矢量映射，并通过简单的基本操作计算句子向量之间的相似性。在本文中，我们介绍了蒸馏句嵌入（DSE） - 一种基于跨细心模型的知识蒸馏的模型，重点关注句子对任务。 DSE的轮廓如下：给出了一个跨学教师模型（例如，一个微调伯特），我们训练基于学生模型的句子来重建教师模型获得的句子对分数。我们经验证明了DSE对五个胶浆森级任务的有效性。 DSE显着优于几种ELMO变体和其他句子嵌入方法，同时通过几个数量级来加速查询候选森林成对相似度的计算，与伯特相比平均相对降解为4.6％。此外，我们表明DSE生产句子嵌入，以达到通用句子表示基准的最先进的性能。我们的代码在HTTPS://github.com/microsoft/distilled-sentence-embeddings上公开提供。

著录项

来源
《AAAI Conference on Artificial Intelligence》|2020年|3096-3881p|共8页
会议地点
作者
Oren Barkan; Noam Razin; Itzik Malkiel; Ori Katz; Avi Caciularu; Noam Koenigstein;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词

相似文献

外文文献
中文文献
专利

1. Fast and scalable neural embedding models for biomedical sentence classification [J] . Asan Agibetov, Kathrin Blagec, Hong Xu, BMC Bioinformatics . 2018,第1期

机译：用于生物医学句子分类的快速和可扩展的神经嵌入模型
2. Enhanced attentive convolutional neural networks for sentence pair modeling [J] . Xu Shiyao, Shijia E., Xiang Yang Expert systems with applications . 2020,第Auga期

机译：增强句子对建模的细心卷积神经网络
3. Label-Embedding Bi-directional Attentive Model for Multi-label Text Classification [J] . Liu Naiyin, Wang Qianlong, Ren Jiangtao Neural processing letters . 2021,第1期

机译：用于多标签文本分类的标签嵌入双向周度型号
4. Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding [C] . Oren Barkan, Noam Razin, Itzik Malkiel, AAAI Conference on Artificial Intelligence . 2020

机译：通过蒸馏句嵌入的可扩展细心句子建模
5. Neural models of multi-scale image completion and of featural bias during attentive memory search. [D] . Gaddam, Sai Chaitanya. 2009

机译：注意力记忆搜索过程中多尺度图像完成和胎儿偏向的神经模型。
6. Fast and scalable neural embedding models for biomedical sentence classification [O] . Asan Agibetov, Kathrin Blagec, Hong Xu, 2018

机译：快速可扩展的神经嵌入模型用于生物医学句子分类
7. Learning to Embed Sentences Using Attentive Recursive Trees [O] . Jiaxin Shi, Lei Hou, Juanzi Li, 2019

机译：学习使用细心递归树嵌入句子

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding

摘要

著录项

相似文献

相关主题

期刊订阅