...
首页> 外文期刊>Multimedia, IEEE Transactions on >Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help
【24h】

Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help

机译:言语与笑声之间的视听歧视:为什么以及何时可以提供视觉信息

获取原文
获取原文并翻译 | 示例

摘要

Past research on automatic laughter classification/detection has focused mainly on audio-based approaches. Here we present an audiovisual approach to distinguishing laughter from speech, and we show that integrating the information from audio and video channels may lead to improved performance over single-modal approaches. Both audio and visual channels consist of two streams (cues), facial expressions and head pose for video and cepstral and prosodic features for audio. Two types of experiments were performed: 1) subject-independent cross-validation on the AMI dataset and 2) cross-database experiments on the AMI and SAL datasets. We experimented with different combinations of cues with the most informative being the combination of facial expressions, cepstral, and prosodic features. Our results suggest that the performance of the audiovisual approach is better on average than single-modal approaches. The addition of visual information produces better results when it comes to female subjects. When the training conditions are less diverse in terms of head movements than the testing conditions (training on the SAL dataset, testing on the AMI dataset), then no improvement was observed with the addition of visual information. On the other hand, when the training conditions are similar (cross validation on the AMI dataset), or more diverse (training on the AMI dataset, testing on the SAL dataset), in terms of head movements than is the case in the testing conditions, an absolute increase of about 3% in the F1 rate for laughter is reported when visual information is added to audio information.
机译:过去有关自动笑声分类/检测的研究主要集中在基于音频的方法上。在这里,我们提出了一种区分笑声和语音的视听方法,并且我们展示了集成来自音频和视频通道的信息可能会比单模方法带来更高的性能。音频和视频通道均包含两个流(提示),视频的面部表情和头部姿势以及音频的倒谱和韵律特征。进行了两种类型的实验:1)在AMI数据集上与受试者无关的交叉验证,以及2)在AMI和SAL数据集上的跨数据库实验。我们尝试了不同的提示组合,其中最有用的是面部表情,倒谱和韵律特征的组合。我们的结果表明,视听方法的性能平均比单模方法更好。对于女性对象,添加视觉信息会产生更好的效果。当训练条件在头部运动方面的差异小于测试条件(在SAL数据集上进行训练,在AMI数据集上进行测试)时,则在添加视觉信息后没有观察到任何改善。另一方面,在头部运动方面,训练条件相似(在AMI数据集上交叉验证)或更多样化(在AMI数据集上训练,在SAL数据集上进行测试)时,与在测试条件下相比,头部运动当将视觉信息添加到音频信息时,据报F1笑声的绝对增加了3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号