首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection
【24h】

Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection

机译:Ava Active Speaker:用于有源扬声器检测的视听数据集

获取原文

摘要

Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.
机译:主动的说话人检测是视频分析算法中重要的组件,适用于诸如说话人区分,会议视频重定目标,语音增强和人机交互等应用。缺少大型,经过仔细标记的视听有源说话人数据集,在数据多样性,环境和准确性方面的评估有限。在本文中,我们介绍了AVA主动说话者检测数据集(AVA-ActiveSpeaker),该数据集已公开发布以促进算法开发和比较。它包含视频中带有时间标记的脸部轨迹,其中每个脸部实例都标记为正在说话或不讲话,以及语音是否可听。该数据集包含约365万个人类标记的帧,跨越38.5小时。我们还介绍了一种最新的,经过联合训练的视听模型,用于实时主动说话者检测,并比较了几种变体。评估清楚地表明,由于视听建模和多个框架上的时间整合,获得了显着收益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号