首页> 外文会议>IEEE International Conference on Multimedia and Expo >What Matters: Attentive and Relational Feature Aggregation Network for Video-Text Retrieval
【24h】

What Matters: Attentive and Relational Feature Aggregation Network for Video-Text Retrieval

机译:重要的是:用于视频文本检索的细心和关系特征聚合网络

获取原文

摘要

Cross-modal video-text retrieval has been an emerging task due to the rapid growth of user-generated videos on the Internet. Most existing approaches focus on extracting visual feature for the video, while audio and caption on the screen containing rich information are ignored. Recently, the aggregations of multi-modal features in videos boost the benchmark of video-text retrieval. However, since these multi-modal features are high-dimensional and heterogeneous, their intrinsically structural relations have not been attached with enough importance and are often overlooked in previous methods. To address this issue, we propose a novel Attentive and Relational Feature Aggregation Network (ARFAN). Specifically, we introduce the self-attention mechanism to make videos adaptively assign higher weights to the representative modalities. Then, the graph convolutional layers are inserted to capture the relations among the multi-modal features to combine them. Our method achieves 15% and 12.9% relative improvements on R@1 when compared with the state-of-the-art method on MSR-VTT and MSVD datasets, respectively.
机译:由于互联网上的用户生成的视频的快速增长,跨模型视频文本检索是新兴任务。大多数现有方法侧重于提取视频的可视特征,而忽略包含丰富信息的屏幕上的音频和标题。最近,视频中的多模态特征的聚合提升了视频文本检索的基准。然而,由于这些多模态特征是高维和异构的,因此其本质上结构关系没有足够的重要性,并且通常以先前的方法忽略。要解决此问题,我们提出了一种新颖的细心和关系特征聚合网络(ARFAN)。具体而言,我们介绍了自我关注机制,使视频自适应地将更高权重分配给代表方式。然后,插入图形卷积层以捕获多模态特征之间的关系以组合它们。与MSR-VTT和MSVD数据集的最新方法相比,我们的方法在与MSR-VTT和MSVD数据集上相比,R = 1的相对改进达到了15%和12.9%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号