首页>
外国专利>
Spatial attention model for image caption generation
Spatial attention model for image caption generation
展开▼
机译:图像标题生成的空间注意模型
展开▼
页面导航
摘要
著录项
相似文献
摘要
A method of automatic image captioning, the method including: mixing results of an image encoder and a language decoder to emit a sequence of caption words for an input image, with the mixing governed by a gate probability mass determined from a visual sentinel vector of the language decoder and a current hidden state vector of the language decoder; determining the results of the image encoder by processing the image through the image encoder to produce image feature vectors for regions of the image and computing a global image feature vector from the image feature vectors; determining the results of the language decoder by processing words through the language decoder, including beginning at an initial timestep with a start-of-caption token and the global image feature vector, continuing in successive timesteps using a most recently emitted caption word and the global image feature vector as input to the language decoder, and at each timestep, generating a visual sentinel vector that combines the most recently emitted caption word, the global image feature vector, a previous hidden state vector of the language decoder, and memory contents of the language decoder; at each timestep, using at least a current hidden state vector of the language decoder to determine unnormalized attention values for the image feature vectors and an unnormalized gate value for the visual sentinel vector; concatenating the unnormalized attention values and the unnormalized gate value and exponentially normalizing the concatenated attention and gate values to produce a vector of attention probability masses and the gate probability mass; applying the attention probability masses to the image feature vectors to accumulate in an image context vector a weighted sum of the image feature vectors; determining an adaptive context vector as a mix of the image context vector and the visual sentinel vector according to the gate probability mass; submitting the adaptive context vector and the current hidden state of the language decoder to a feed-forward neural network and causing the feed-forward neural network to emit a next caption word; and repeating the processing of words through the language decoder, the using, the concatenating, the applying, the determining, and the submitting until the next caption word emitted is an end-of-caption token.
展开▼