首页> 外文期刊>Multimedia, IEEE Transactions on >Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification
【24h】

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

机译:在混合深度学习框架中为视频分类建模多峰线索

获取原文
获取原文并翻译 | 示例

摘要

Videos are inherently multimodal. This paper studies the problem of exploiting the abundant multimodal clues for improved video classification performance. We introduce a novel hybrid deep learning framework that integrates useful clues from multiple modalities, including static spatial appearance information, motion patterns within a short time window, audio information, as well as long-range temporal dynamics. More specifically, we utilize three Convolutional Neural Networks (CNNs) operating on appearance, motion, and audio signals to extract their corresponding features. We then employ a feature fusion network to derive a unified representation with an aim to capture the relationships among features. Furthermore, to exploit the long-range temporal dynamics in videos, we apply two long short-term memory (LSTM) networks with extracted appearance and motion features as inputs. Finally, we also propose refining the prediction scores by leveraging contextual relationships among video semantics. The hybrid deep learning framework is able to exploit a comprehensive set of multimodal features for video classification. Through an extensive set of experiments, we demonstrate that: 1) LSTM networks that model sequences in an explicitly recurrent manner are highly complementary to the CNN models; 2) the feature fusion network that produces a fused representation through modeling feature relationships outperforms a large set of alternative fusion strategies; and 3) the semantic context of video classes can help further refine the predictions for improved performance. Experimental results on two challenging benchmarks-the UCF-101 and the Columbia Consumer Videos (CCV)-provide strong quantitative evidence that our framework can produce promising results: 93.1% on the UCF-101 and 84.5% on the CCV, outperforming several competing methods with clear margins.
机译:视频本质上是多模式的。本文研究了利用丰富的多峰线索提高视频分类性能的问题。我们介绍了一种新颖的混合深度学习框架,该框架整合了来自多种模式的有用线索,包括静态空间外观信息,短时间窗口内的运动模式,音频信息以及远程时态动态。更具体地说,我们利用对外观,运动和音频信号进行操作的三个卷积神经网络(CNN)来提取它们的相应特征。然后,我们使用特征融合网络来导出统一表示,目的是捕获特征之间的关系。此外,为了利用视频中的远程时间动态,我们应用了两个具有提取的外观和运动特征作为输入的长短期记忆(LSTM)网络。最后,我们还建议通过利用视频语义之间的上下文关系来完善预测分数。混合深度学习框架能够为视频分类开发一套全面的多峰功能。通过一系列广泛的实验,我们证明:1)以明确循环方式对序列进行建模的LSTM网络与CNN模型高度互补; 2)通过建模特征关系产生融合表示的特征融合网络的性能优于大量替代融合策略;和3)视频类的语义上下文可以帮助进一步完善预测,以提高性能。在两个具有挑战性的基准(UCF-101和哥伦比亚消费者视频(CCV))上的实验结果提供了强有力的定量证据,证明我们的框架可以产生令人鼓舞的结果:UCF-101为93.1%,CCV为84.5%,明显优于其他竞争方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号