首页> 外文会议>IEEE Workshop on Spoken Language Technology >Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view
【24h】

Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view

机译:两个扬声器场景中的视听语音活动检测,其中包含来自配置文件或正视图的深度信息

获取原文

摘要

Motivated by increasing popularity of depth visual sensors, such as the Kinect device, we investigate the utility of depth information in audio-visual speech activity detection. A two-subject scenario is assumed, allowing to also consider speech overlap. Two sensory setups are employed, where depth video captures either a frontal or profile view of the subjects, and is subsequently combined with the corresponding planar video and audio streams. Further, multi-view fusion is regarded, using audio and planar video from a sensor at the complementary view setup. Support vector machines provide temporal speech activity classification for each visually detected subject, fusing the available modality streams. Classification results are further combined to yield speaker diarization. Experiments are reported on a suitable audio-visual corpus recorded by two Kinects. Results demonstrate the benefits of depth information, particularly in the frontal depth view setup, reducing speech activity detection and speaker diarization errors over systems that ignore it.
机译:通过增加深度视觉传感器的普及,例如Kinect设备,我们研究了视听语音活动检测中深度信息的效用。假设两个主题场景,允许考虑语音重叠。使用两个感官设置,其中深度视频捕获对象的正面或轮廓视图,随后与相应的平面视频和音频流组合。此外,使用来自互补视图设置的传感器的音频和平面视频被认为是多视图融合。支持向量机为每个视觉检测到的主体提供时间语音活动分类,融合可用的模态流。分类结果进一步结合起来屈服扬声器日期。在两种Kinects记录的合适视听语料库上报道了实验。结果展示了深度信息的好处,特别是在额度深度视图设置中,减少了忽略它的系统上的语音活动检测和扬声器日益衰退。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号