首页> 外文会议>International Conference on Computational Science >2D-Convolution Based Feature Fusion for Cross-Modal Correlation Learning
【24h】

2D-Convolution Based Feature Fusion for Cross-Modal Correlation Learning

机译:基于二维卷积的特征融合用于跨模态相关学习

获取原文

摘要

Cross-modal information retrieval (CMIR) enables users to search for semantically relevant data of various modalities from a given query of one modality. The predominant challenge is to alleviate the "heterogeneous gap" between different modalities. For text-image retrieval, the typical solution is to project text features and image features into a common semantic space and measure the cross-modal similarity. However, semantically relevant data from different modalities usually contains imbalanced information. Aligning all the modalities in the same space will weaken modal-specific semantics and introduce unexpected noise. In this paper, we propose a novel CMIR framework based on multi-modal feature fusion. In this framework, the cross-modal similarity is measured by directly analyzing the fine-grained correlations between the text features and image features without common semantic space learning. Specifically, we preliminarily construct a cross-modal feature matrix to fuse the original visual and textural features. Then the 2D-convolutional networks are proposed to reason about inner-group relationships among features across modalities, resulting in fine-grained text-image representations. The cross-modal similarity is measured by a multi-layer perception based on the fused feature representations. We conduct extensive experiments on two representative CMIR datasets, i.e. English Wikipedia and TVGraz. Experimental results indicate that our model outperforms state-of-the-art methods significantly. Meanwhile, the proposed cross-modal feature fusion approach is more effective in the CMIR tasks compared with other feature fusion approaches.
机译:跨模态信息检索(CMIR)使用户可以从一种模态的给定查询中搜索各种模态的语义相关数据。主要挑战是减轻不同方式之间的“异质性差距”。对于文本图像检索,典型的解决方案是将文本特征和图像特征投影到公共语义空间中,并测量跨模式相似性。但是,来自不同形式的语义相关数据通常包含不平衡信息。对齐同一空间中的所有模态会削弱特定于模态的语义并引入意外的噪声。在本文中,我们提出了一种基于多模式特征融合的新颖CMIR框架。在此框架中,跨模式相似性是通过直接分析文本特征和图像特征之间的细粒度相关性而无需共同的语义空间学习来测量的。具体来说,我们初步构建了一个交叉模式特征矩阵,以融合原始的视觉和纹理特征。然后,提出了二维卷积网络,以对跨模态的特征之间的内部组关系进行推理,从而得到细粒度的文本图像表示。跨模态相似性是通过基于融合特征表示的多层感知来测量的。我们对两个代表性的CMIR数据集(即英语Wikipedia和TVGraz)进行了广泛的实验。实验结果表明,我们的模型明显优于最新方法。同时,与其他特征融合方法相比,所提出的交叉模式特征融合方法在CMIR任务中更有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号