首页>
外国专利>
AUDIO-VISUAL FUSION WITH CROSS-MODAL ATTENTION FOR VIDEO ACTION RECOGNITION
AUDIO-VISUAL FUSION WITH CROSS-MODAL ATTENTION FOR VIDEO ACTION RECOGNITION
展开▼
机译:视频动作识别的跨模态注意力视听融合
展开▼
页面导航
摘要
著录项
相似文献
摘要
An electronic device obtains video content that includes visual content and audio content. The visual content includes a plurality of visual segments, and the audio content includes a plurality of audio contents. A plurality of self-attended visual features are generated for the visual segments of the video content, and a plurality of self-attended audio features are generated for the audio segments of the audio content. The self-attended visual features are fused with the self-attended audio features to generate a plurality of fused visual features, and the self-attended audio features are fused with the self-attended visual features to generate a plurality of fused audio features. The fused visual features and the fused audio features are combined to generate a cross-modal visual-audio feature based on a respective weight associated with each of the fused visual and audio features. A video-level content label is determined based on the cross-modal visual-audio feature.
展开▼