首页> 外文会议>AAAI Conference on Artificial Intelligence >Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding
【24h】

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding

机译:通过蒸馏句嵌入的可扩展细心句子建模

获取原文

摘要

Recent state-of-the-art natural language understanding models, such as BERT and XLNet, score a pair of sentences (A and B) using multiple cross-attention operations - a process in which each word in sentence A attends to all words in sentence B and vice versa. As a result, computing the similarity between a query sentence and a set of candidate sentences, requires the propagation of all query-candidate sen-tence-pairs throughout a stack of cross-attention layers. This exhaustive process becomes computationally prohibitive when the number of candidate sentences is large. In contrast, sentence embedding techniques learn a sentence-to-vector mapping and compute the similarity between the sentence vectors via simple elementary operations. In this paper, we introduce Distilled Sentence Embedding (DSE) - a model that is based on knowledge distillation from cross-attentive models, focusing on sentence-pair tasks. The outline of DSE is as follows: Given a cross-attentive teacher model (e.g. a fine-tuned BERT), we train a sentence embedding based student model to reconstruct the sentence-pair scores obtained by the teacher model. We empirically demonstrate the effectiveness of DSE on five GLUE sen-tence-pair tasks. DSE significantly outperforms several ELMO variants and other sentence embedding methods, while accelerating computation of the query-candidate sen-tence-pairs similarities by several orders of magnitude, with an average relative degradation of 4.6% compared to BERT. Furthermore, we show that DSE produces sentence embeddings that reach state-of-the-art performance on universal sentence representation benchmarks. Our code is made publicly available at https://github.com/microsoft/Distilled-Sentence-Embedding.
机译:最近的最先进的自然语言理解模型,例如BERT和XLNET,使用多个横向操作来分级一对句子(A和B) - 一个过程中句子中的每个单词都参加所有单词句子B反之亦然。结果,计算查询句子和一组候选句子之间的相似性,需要在跨关注层的堆叠中传播所有查询候选森格对。当候选句子的数量大时,这种详尽的过程变得越来越高。相比之下,句子嵌入技术学习句子到矢量映射,并通过简单的基本操作计算句子向量之间的相似性。在本文中,我们介绍了蒸馏句嵌入(DSE) - 一种基于跨细心模型的知识蒸馏的模型,重点关注句子对任务。 DSE的轮廓如下:给出了一个跨学教师模型(例如,一个微调伯特),我们训练基于学生模型的句子来重建教师模型获得的句子对分数。我们经验证明了DSE对五个胶浆森级任务的有效性。 DSE显着优于几种ELMO变体和其他句子嵌入方法,同时通过几个数量级来加速查询候选森林成对相似度的计算,与伯特相比平均相对降解为4.6%。此外,我们表明DSE生产句子嵌入,以达到通用句子表示基准的最先进的性能。我们的代码在HTTPS://github.com/microsoft/distilled-sentence-embeddings上公开提供。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号