【24h】

Object Counts! Bringing Explicit Detections Back into Image Captioning

机译:对象计数!将显式检测重新带到图像字幕中

获取原文

摘要

The use of explicit object detectors as an intermediate step to image captioning - which used to constitute an essential stage in early work - is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a mid-level image embedding. We argue that explicit detections provide rich semantic information, and can thus be used as an interpretable representation to better understand why end-to-end image captioning systems work well. We provide an in-depth analysis of end-to-end image captioning by exploring a variety of cues that can be derived from such object detections. Our study reveals that end-to-end image captioning systems rely on matching image representations to generate captions, and that encoding the frequency, size and position of objects are complementary and all play a role in forming a good image representation. It also reveals that different object categories contribute in different ways towards image captioning.
机译:在当前占主导地位的端到端方法中,通常将语言对象直接作为中间语言的条件,而绕过了将显式对象检测器用作图像字幕的中间步骤的过程,该步骤曾经构成早期工作的必不可少的步骤。级图像嵌入。我们认为显式检测提供了丰富的语义信息,因此可以用作可解释的表示形式,以更好地理解为什么端到端图像字幕系统可以很好地工作。通过探索可从此类物体检测中得出的各种线索,我们提供了对端到端图像字幕的深入分析。我们的研究表明,端到端图像字幕系统依靠匹配的图像表示来生成字幕,并且编码对象的频率,大小和位置是互补的,并且都在形成良好的图像表示中发挥作用。它还揭示了不同的对象类别以不同的方式对图像字幕做出了贡献。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号