首页> 外国专利> Spatial attention model for image caption generation

Spatial attention model for image caption generation

机译:图像标题生成的空间注意模型

摘要

A method of automatic image captioning, the method including: mixing results of an image encoder and a language decoder to emit a sequence of caption words for an input image, with the mixing governed by a gate probability mass determined from a visual sentinel vector of the language decoder and a current hidden state vector of the language decoder; determining the results of the image encoder by processing the image through the image encoder to produce image feature vectors for regions of the image and computing a global image feature vector from the image feature vectors; determining the results of the language decoder by processing words through the language decoder, including beginning at an initial timestep with a start-of-caption token and the global image feature vector, continuing in successive timesteps using a most recently emitted caption word and the global image feature vector as input to the language decoder, and at each timestep, generating a visual sentinel vector that combines the most recently emitted caption word, the global image feature vector, a previous hidden state vector of the language decoder, and memory contents of the language decoder; at each timestep, using at least a current hidden state vector of the language decoder to determine unnormalized attention values for the image feature vectors and an unnormalized gate value for the visual sentinel vector; concatenating the unnormalized attention values and the unnormalized gate value and exponentially normalizing the concatenated attention and gate values to produce a vector of attention probability masses and the gate probability mass; applying the attention probability masses to the image feature vectors to accumulate in an image context vector a weighted sum of the image feature vectors; determining an adaptive context vector as a mix of the image context vector and the visual sentinel vector according to the gate probability mass; submitting the adaptive context vector and the current hidden state of the language decoder to a feed-forward neural network and causing the feed-forward neural network to emit a next caption word; and repeating the processing of words through the language decoder, the using, the concatenating, the applying, the determining, and the submitting until the next caption word emitted is an end-of-caption token.
机译:一种自动图像标题的方法,该方法包括:图像编码器的混合结果和语言解码器发出输入图像的一系列标题单词,其混合由从视觉哨兵向量中确定的栅极概率质量控制语言解码器和语言解码器的当前隐藏状态向量;通过图像编码器处理图像来生成图像特征向量的图像特征向量来确定图像特征向量,并计算来自图像特征向量的全局图像特征向量;通过使用语言解码器处理单词来确定语言解码器的结果,包括从标题开始令牌和全局图像特征向量的初始时间点开始,使用最近发出的字幕字和全局在连续的时间步骤中继续图像特征向量作为输入到语言解码器的输入,并且在每个时间步骤中,生成与最近发射的标题字,全局图像特征向量,先前隐藏状态向量的语言解码器的上一个隐藏状态向量的传染媒介,以及内存内容语言解码器;在每个时间步骤中,使用语言解码器的至少一个当前隐藏的状态向量来确定图像特征向量的非正规化注意值和视觉哨兵向量的非正规化门值;连接不正常的注意值和非全体化栅极值,并指数地归一化串联的关注和栅极值,以产生注意概率质量和栅极概率质量的矢量;将注意概率质量施加到图像特征向量,以在图像上下文中累积矢量特征向量的加权和;根据栅极概率质量确定自适应上下文向量作为图像上下文向量的混合和视觉哨兵向量;将语言解码器的当前隐藏状态提交到前馈神经网络并使前馈神经网络发出下一个标题字;并通过语言解码器重复处理单词,使用,连接,应用,确定和提交,直到发出的下一个字幕词是标题末端令牌。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号