首页> 外文会议>IEEE Workshop on Spoken Language Technology >Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view
【24h】

Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view

机译:在两个扬声器的场景中进行视听语音活动检测,其中包含来自侧面或正面的深度信息

获取原文

摘要

Motivated by increasing popularity of depth visual sensors, such as the Kinect device, we investigate the utility of depth information in audio-visual speech activity detection. A two-subject scenario is assumed, allowing to also consider speech overlap. Two sensory setups are employed, where depth video captures either a frontal or profile view of the subjects, and is subsequently combined with the corresponding planar video and audio streams. Further, multi-view fusion is regarded, using audio and planar video from a sensor at the complementary view setup. Support vector machines provide temporal speech activity classification for each visually detected subject, fusing the available modality streams. Classification results are further combined to yield speaker diarization. Experiments are reported on a suitable audio-visual corpus recorded by two Kinects. Results demonstrate the benefits of depth information, particularly in the frontal depth view setup, reducing speech activity detection and speaker diarization errors over systems that ignore it.
机译:由于深度视觉传感器(例如Kinect设备)的日益普及,我们研究了深度信息在视听语音活动检测中的实用性。假设有两个主题的情况,也可以考虑语音重叠。采用两种感官设置,其中深度视频捕获对象的正面或侧面视图,然后与相应的平面视频和音频流组合。此外,在互补视图设置中,使用来自传感器的音频和平面视频来考虑多视图融合。支持向量机通过融合可用的模态流,为每个视觉检测到的对象提供时间语音活动分类。分类结果被进一步组合以产生说话者二分法。据报道,有两个Kinects记录了合适的视听语料库。结果证明了深度信息的好处,特别是在正面深度视图设置中,与忽略该信息的系统相比,减少了语音活动检测和说话者歧义错误。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号