Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

Yu-Gang Jiang; Zuxuan Wu; Jinhui Tang; Zechao Li; Xiangyang Xue; Shih-Fu Chang

首页> 外文期刊>Multimedia, IEEE Transactions on >Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

【24h】

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

机译：在混合深度学习框架中为视频分类建模多峰线索

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Videos are inherently multimodal. This paper studies the problem of exploiting the abundant multimodal clues for improved video classification performance. We introduce a novel hybrid deep learning framework that integrates useful clues from multiple modalities, including static spatial appearance information, motion patterns within a short time window, audio information, as well as long-range temporal dynamics. More specifically, we utilize three Convolutional Neural Networks (CNNs) operating on appearance, motion, and audio signals to extract their corresponding features. We then employ a feature fusion network to derive a unified representation with an aim to capture the relationships among features. Furthermore, to exploit the long-range temporal dynamics in videos, we apply two long short-term memory (LSTM) networks with extracted appearance and motion features as inputs. Finally, we also propose refining the prediction scores by leveraging contextual relationships among video semantics. The hybrid deep learning framework is able to exploit a comprehensive set of multimodal features for video classification. Through an extensive set of experiments, we demonstrate that: 1) LSTM networks that model sequences in an explicitly recurrent manner are highly complementary to the CNN models; 2) the feature fusion network that produces a fused representation through modeling feature relationships outperforms a large set of alternative fusion strategies; and 3) the semantic context of video classes can help further refine the predictions for improved performance. Experimental results on two challenging benchmarks-the UCF-101 and the Columbia Consumer Videos (CCV)-provide strong quantitative evidence that our framework can produce promising results: 93.1% on the UCF-101 and 84.5% on the CCV, outperforming several competing methods with clear margins.

机译：视频本质上是多模式的。本文研究了利用丰富的多峰线索提高视频分类性能的问题。我们介绍了一种新颖的混合深度学习框架，该框架整合了来自多种模式的有用线索，包括静态空间外观信息，短时间窗口内的运动模式，音频信息以及远程时态动态。更具体地说，我们利用对外观，运动和音频信号进行操作的三个卷积神经网络（CNN）来提取它们的相应特征。然后，我们使用特征融合网络来导出统一表示，目的是捕获特征之间的关系。此外，为了利用视频中的远程时间动态，我们应用了两个具有提取的外观和运动特征作为输入的长短期记忆（LSTM）网络。最后，我们还建议通过利用视频语义之间的上下文关系来完善预测分数。混合深度学习框架能够为视频分类开发一套全面的多峰功能。通过一系列广泛的实验，我们证明：1）以明确循环方式对序列进行建模的LSTM网络与CNN模型高度互补； 2）通过建模特征关系产生融合表示的特征融合网络的性能优于大量替代融合策略；和3）视频类的语义上下文可以帮助进一步完善预测，以提高性能。在两个具有挑战性的基准（UCF-101和哥伦比亚消费者视频（CCV））上的实验结果提供了强有力的定量证据，证明我们的框架可以产生令人鼓舞的结果：UCF-101为93.1％，CCV为84.5％，明显优于其他竞争方法。

著录项

来源
《Multimedia, IEEE Transactions on》 |2018年第11期|3137-3147|共11页
作者
Yu-Gang Jiang; Zuxuan Wu; Jinhui Tang; Zechao Li; Xiangyang Xue; Shih-Fu Chang;
展开▼
作者单位

School of Computer Science, Fudan University, Shanghai, China;

School of Computer Science, Fudan University, Shanghai, China;

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China;

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China;

School of Computer Science, Fudan University, Shanghai, China;

Department of Electrical Engineering, Columbia University, New York City, NY, USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Semantics; Feature extraction; Hidden Markov models; Machine learning; Optical imaging; Context modeling; Three-dimensional displays;

机译：语义;特征提取;隐马尔可夫模型;机器学习;光学成像;上下文建模;三维显示;

相似文献

外文文献
中文文献
专利

1. Deep learning-based late fusion of multimodal information for emotion classification of music video [J] . Yagya Raj Pandeya, Joonwhoan Lee Multimedia Tools and Applications . 2021,第2期

机译：基于深度学习的音乐视频情感分类的多峰信息深融合
2. Multimodal deep representation learning for video classification [J] . Tian Haiman, Tao Yudong, Pouyanfar Samira, World Wide Web . 2019,第3期

机译：用于视频分类的多模式深度表示学习
3. Multimodal deep representation learning for video classification [J] . Tian Haiman, Tao Yudong, Pouyanfar Samira, World Wide Web . 2019,第3期

机译：视频分类的多模式深度代表学习
4. Classification of sports videos with combination of deep learning models and transfer learning [C] . Mohammad Ashraf Russo, Laksono Kurnianggoro, Kang-Hyun Jo International Conference on Electrical, Computer and Communication Engineering . 2019

机译：结合深度学习模型和迁移学习的体育视频分类
5. Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors [D] . Abtahi, Farnaz. 2018

机译：使用具有音频，视频和生物医学传感器的深度学习模型，对说话人和情感识别进行多模式传感和数据处理
6. Deep-Learning-Based Multimodal Emotion Classification for Music Videos [O] . Yagya Raj Pandeya, Bhuwan Bhattarai, Joonwhoan Lee 2021

机译：基于深度学习的音乐视频的多模式情感分类
7. Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification [O] . Jiang, Yu-Gang, Wu, Zuxuan, Tang, Jinhui, 2017

机译：在视频混合深度学习框架中建模多模态线索分类

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅