...
首页> 外文期刊>IEEE transactions on multimedia >Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning
【24h】

Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning

机译:用于图像标题的多级政策和基于奖励的深度加强学习框架

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Image captioning is one of the most challenging tasks in AI because it requires an understanding of both complex visuals and natural language. Because image captioning is essentially a sequential prediction task, recent advances in image captioning have used reinforcement learning (RL) to better explore the dynamics of word-by-word generation. However, the existing RL-based image captioning methods rely primarily on a single policy network and reward function-an approach that is not well matched to the multi-level (word and sentence) and multi-modal (vision and language) nature of the task. To solve this problem, we propose a novel multi-level policy and reward RL framework for image captioning that can be easily integrated with RNN-based captioning models, language metrics, or visual-semantic functions for optimization. Specifically, the proposed framework includes two modules: 1) a multi-level policy network that jointly updates the word- and sentence-level policies for word generation; and 2) a multi-level reward function that collaboratively leverages both a vision-language reward and a language-language reward to guide the policy. Furthermore, we propose a guidance term to bridge the policy and the reward for RL optimization. The extensive experiments on the MSCOCO and Flickr30k datasets and the analyses show that the proposed framework achieves competitive performances on a variety of evaluation metrics. In addition, we conduct ablation studies on multiple variants of the proposed framework and explore several representative image captioning models and metrics for the word-level policy network and the language-language reward function to evaluate the generalization ability of the proposed framework.
机译:图像标题是AI中最具挑战性的任务之一,因为它需要了解复杂的视觉效果和自然语言。因为图像标题基本上是一个连续的预测任务,所以图像标题的最近进步已经使用了增强学习(RL)来更好地探索逐字生成的动态。然而,现有的基于RL的图像标题方法主要依赖于单个策略网络和奖励函数 - 一种与多级(Word和句)和多模态(视觉和语言)性质不太匹配的方法任务。为了解决这个问题,我们提出了一种新的多级策略,并奖励RL框架的图像标题,可以很容易地与基于RNN的标题模型,语言指标或可视语义功能进行优化。具体而言,所提出的框架包括两个模块:1)一个多级策略网络,共同更新Word生成的单词和句子级策略; 2)一个多级奖励功能,可以协作利用视觉语言奖励和语言奖励来指导政策。此外,我们提出了一项指导术语来弥合政策和RL优化的奖励。 MSCOCO和FLICKR30K数据集的广泛实验和分析表明,该框架在各种评估指标上实现了竞争性表演。此外,我们还对所提出的框架的多种变体进行消融研究,并探索单词级策略网络的几个代表性图像标题模型和度量,语言语言奖励功能来评估所提出的框架的泛化能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号