首页> 外文期刊>Neurocomputing >MSCAN: Multimodal Self-and-Collaborative Attention Network for image aesthetic prediction tasks
【24h】

MSCAN: Multimodal Self-and-Collaborative Attention Network for image aesthetic prediction tasks

机译:MSCAN:用于图像美学预测任务的多模式自我和协作关注网络

获取原文
获取原文并翻译 | 示例

摘要

With the ever-expanding volume of visual images on the Internet, automatic image aesthetic prediction is becoming more and more important in computer vision field. Considering the image aesthetic assessment is a highly subjective and complex task, some researchers resort to the user comments to aid aesthetic prediction. However, these methods only achieve limited success because 1) they rely heavily on convolution to extract visual features, which is difficult to capture the spatial interaction of visual elements in image composition; 2) they treat the image features extraction and textual feature extraction as two distinct tasks and ignore the inter-relationships between these two features. We address these challenges by proposing a Multimodal Self-and-Collaborative Attention Network (MSCAN). More specifically, the self-attention module calculates the response at a position by attending to all positions in the images, thus it can effectively encode spatial interaction of the visual elements. To model the complex image-textual feature relations, a co-attention module is used to jointly perform the textual-guided visual attention and visual-guided textual attention. Then the attended multimodal features are aggregated and sent into a two-layer MLP to obtain the aesthetic values. Extensive experiments over two large benchmarks demonstrate that the proposed MSCAN outperforms the state-of-the-arts by a large margin for unified aesthetic prediction tasks. (c) 2020 Elsevier B.V. All rights reserved.
机译:随着互联网上不断扩大的视觉图像,自动图像美学预测在计算机视觉场中变得越来越重要。考虑到图像审美评估是一项高度主观和复杂的任务,一些研究人员诉诸用户评论以帮助审美预测。然而,这些方法只取得了有限的成功,因为1)它们严重依赖于卷积来提取视觉特征,这难以捕获图像组成中的视觉元素的空间相互作用; 2)它们将图像特征提取和文本特征提取作为两个不同的任务,忽略这两个特征之间的相互关系。我们通过提出多式联运自我和协同关注网络(MSCAN)来解决这些挑战。更具体地,自我注意模块通过参加图像中的所有位置来计算在位置处的响应,因此它可以有效地编码视觉元素的空间交互。为了模拟复杂的图像文本功能关系,使用共同关注模块来共同执行文本引导的视觉关注和视觉引导的文本关注。然后聚合出现的多峰特征并将其送入两层MLP以获得美学值。两个大型基准的广泛实验表明,所提出的MSCAN以统一的审美预测任务的大幅优于最先进的。 (c)2020 Elsevier B.v.保留所有权利。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号