【24h】

See the Sound, Hear the Pixels

机译:看到声音,听见像素

获取原文

摘要

For every event occurring in the real world, most often a sound is associated with the corresponding visual scene. Humans possess an inherent ability to automatically map the audio content with visual scenes leading to an effortless and enhanced understanding of the underlying event. This triggers an interesting question: Can this natural correspondence between video and audio, which has been diminutively explored so far, be learned by a machine and modeled jointly to localize the sound source in a visual scene? In this paper, we propose a novel algorithm that addresses the problem of localizing sound source in unconstrained videos, which uses efficient fusion and attention mechanisms. Two novel blocks namely, Audio Visual Fusion Block (AVFB) and Segment-Wise Attention Block (SWAB) have been developed for this purpose. Quantitative and qualitative evaluations show that it is feasible to use the same algorithm with minor modifications to serve the purpose of sound localization using three different types of learning: supervised, weakly supervised and unsupervised. A novel Audio Visual Triplet Gram Matrix Loss (AVTGML) has been proposed as a loss function to learn the localization in an unsupervised way. Our empirical evaluations demonstrate a significant increase in performance over the existing state-of-the-art methods, serving as a testimony to the superiority of our proposed approach.
机译:对于现实世界中发生的每个事件,声音通常与相应的视觉场景相关联。人类具有自动将音频内容映射到视觉场景的固有能力,从而可以毫不费力地增强对潜在事件的理解。这引发了一个有趣的问题:迄今为止,已经进行了少量研究的视频和音频之间的这种自然对应关系,是否可以由机器学习并联合建模以将声源定位在视觉场景中?在本文中,我们提出了一种新颖的算法,该算法使用有效的融合和注意力机制来解决在不受约束的视频中定位声源的问题。为此,已经开发了两个新颖的块,即视听融合块(AVFB)和分段明智的关注块(SWAB)。定量和定性评估表明,使用经过少量修改的相同算法来实现声音定位的目的是可行的,它使用三种不同类型的学习方法:有监督,弱监督和无监督。提出了一种新颖的视听三重语法矩阵损失(AVTGML)作为损失函数,以一种无监督的方式学习定位。我们的经验评估表明,与现有的最先进方法相比,该方法的性能有了显着提高,这证明了我们所提出方法的优越性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号