首页> 外文会议>IEEE International Conference on Multimedia and Expo >What Matters: Attentive and Relational Feature Aggregation Network for Video-Text Retrieval

【24h】

What Matters: Attentive and Relational Feature Aggregation Network for Video-Text Retrieval

机译：重要的是：用于视频文本检索的细心和关系特征聚合网络

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Cross-modal video-text retrieval has been an emerging task due to the rapid growth of user-generated videos on the Internet. Most existing approaches focus on extracting visual feature for the video, while audio and caption on the screen containing rich information are ignored. Recently, the aggregations of multi-modal features in videos boost the benchmark of video-text retrieval. However, since these multi-modal features are high-dimensional and heterogeneous, their intrinsically structural relations have not been attached with enough importance and are often overlooked in previous methods. To address this issue, we propose a novel Attentive and Relational Feature Aggregation Network (ARFAN). Specifically, we introduce the self-attention mechanism to make videos adaptively assign higher weights to the representative modalities. Then, the graph convolutional layers are inserted to capture the relations among the multi-modal features to combine them. Our method achieves 15% and 12.9% relative improvements on R@1 when compared with the state-of-the-art method on MSR-VTT and MSVD datasets, respectively.

机译：由于互联网上的用户生成的视频的快速增长，跨模型视频文本检索是新兴任务。大多数现有方法侧重于提取视频的可视特征，而忽略包含丰富信息的屏幕上的音频和标题。最近，视频中的多模态特征的聚合提升了视频文本检索的基准。然而，由于这些多模态特征是高维和异构的，因此其本质上结构关系没有足够的重要性，并且通常以先前的方法忽略。要解决此问题，我们提出了一种新颖的细心和关系特征聚合网络（ARFAN）。具体而言，我们介绍了自我关注机制，使视频自适应地将更高权重分配给代表方式。然后，插入图形卷积层以捕获多模态特征之间的关系以组合它们。与MSR-VTT和MSVD数据集的最新方法相比，我们的方法在与MSR-VTT和MSVD数据集上相比，R = 1的相对改进达到了15％和12.9％。

著录项

来源
《IEEE International Conference on Multimedia and Expo 》|2021年|1-6|共6页
会议地点
作者
Xiaoshuai Hao; Yucan Zhou; Dayan Wu; Wanqian Zhang; Bo Li; Weiping Wang; Dan Meng;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Visualization; Conferences; Streaming media; Benchmark testing; Feature extraction; Internet; Data mining;

机译：可视化;会议;流媒体;基准测试;特征提取;互联网;数据挖掘;

相似文献

外文文献
中文文献
专利

1. Joint embeddings withmultimodal cues for video-text retrieval [J] . Niluthpol C. Mithun, Juncheng Li, Florian Metze, International Journal of Multimedia Information Retrieval . 2019 ,第1期

机译：带有用于视频文本检索的多样性线索的联合嵌入
2. Learning Aligned Image-Text Representations Using Graph Attentive Relational Network [J] . Ya Jing, Wei Wang, Liang Wang, IEEE Transactions on Image Processing . 2021 ,第1期

机译：使用图形细心关系网络学习对齐的图像文本表示
3. Convolutional neural networks for relevance feedback in content based image retrieval A Content based image retrieval system that exploits convolutional neural networks both for feature extraction and for relevance feedback [J] . Lorenzo Putzu, Luca Piras, Giorgio Giacinto Multimedia Tools and Applications . 2020 ,第37a38期

机译：基于内容的图像检索的相关反馈的卷积神经网络基于内容的图像检索系统，用于利用特征提取和相关性反馈的卷积神经网络
4. Multi-Dimensional Attentive Hierarchical Graph Pooling Network for Video-Text Retrieval [C] . Dehao Wu, Yi Li, Yinghong Zhang, IEEE International Conference on Multimedia and Expo . 2021

机译：用于视频文本检索的多维殷勤分层图形池汇总网络
5. Image content matching and retrieval using attributed feature-relational graph and perceptual organizations. [D] . Li, Wenyi. 2002

机译：使用归因特征关系图和感知组织进行图像内容匹配和检索。
6. Executive Impairment in Alcohol Use Disorder Reflects Structural Changes in Large-Scale Brain Networks: A Joint Independent Component Analysis on Gray-Matter and White-Matter Features [O] . Chiara Crespi, Caterina Galandra, Marina Manera, 2019

机译：酒精使用障碍的行政障碍反映了大型脑网络中的结构变化：灰度和白品特征的联合独立分量分析
7. What Matters: Attentive and Relational Feature Aggregation Network for Video-Text Retrieval [O] . Xiaoshuai Hao, Yucan Zhou, Dayan Wu, 2021

机译：什么事：用于视频文本检索的细心和关系特征聚合网络

What Matters: Attentive and Relational Feature Aggregation Network for Video-Text Retrieval

摘要

著录项

相似文献

相关主题

期刊订阅