首页> 外文会议>2017 IEEE Automatic Speech Recognition and Understanding Workshop >Early and late integration of audio features for automatic video description
【24h】

Early and late integration of audio features for automatic video description

机译:音频功能的早期和晚期集成,用于自动视频描述

获取原文
获取原文并翻译 | 示例

摘要

This paper presents our approach to improve video captioning by integrating audio and video features. Video captioning is the task of generating a textual description to describe the content of a video. State-of-the-art approaches to video captioning are based on sequence-to-sequence models, in which a single neural network accepts sequential images and audio data, and outputs a sequence of words that best describe the input data in natural language. The network thus learns to encode the video input into an intermediate semantic representation, which can be useful in applications such as multimedia indexing, automatic narration, and audio-visual question answering. In our prior work, we proposed an attention-based multi-modal fusion mechanism to integrate image, motion, and audio features, where the multiple features are integrated in the network. Here, we apply hypothesis-level integration based on minimum Bayes-risk (MBR) decoding to further improve the caption quality, focusing on well-known evaluation metrics (BLEU and METEOR scores). Experiments with the YouTube2Text and MSR-VTT datasets demonstrate that combinations of early and late integration of multimodal features significantly improve the audio-visual semantic representation, as measured by the resulting caption quality. In addition, we compared the performance of our method using two different types of audio features: MFCC features, and the audio features extracted using SoundNet, which was trained to recognize objects and scenes from videos using only the audio signals.
机译:本文介绍了我们通过集成音频和视频功能来改善视频字幕的方法。视频字幕是生成文本描述以描述视频内容的任务。最新的视频字幕方法基于序列到序列的模型,其中单个神经网络接受序列的图像和音频数据,并输出最能用自然语言描述输入数据的单词序列。网络因此学习将视频输入编码为中间语义表示,这在诸如多媒体索引,自动旁白和视听问题解答之类的应用中可能有用。在我们先前的工作中,我们提出了一种基于注意力的多模式融合机制来集成图像,运动和音频功能,其中多个功能集成在网络中。在这里,我们基于最小贝叶斯风险(MBR)解码应用假设级别的集成,以进一步提高字幕质量,重点是众所周知的评估指标(BLEU和METEOR得分)。使用YouTube2Text和MSR-VTT数据集进行的实验表明,通过生成的字幕质量衡量,多模式功能的早期和晚期整合的组合可显着改善视听语义表示。此外,我们比较了使用两种不同类型的音频功能的方法的性能:MFCC功能和使用SoundNet提取的音频功能,SoundNet经过训练可以仅使用音频信号从视频中识别对象和场景。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号