首页> 外文期刊>Neurocomputing >Audio-visual domain adaptation using conditional semi-supervised Generative Adversarial Networks
【24h】

Audio-visual domain adaptation using conditional semi-supervised Generative Adversarial Networks

机译:视听域使用条件半监督生成的对抗网络适应

获取原文
获取原文并翻译 | 示例

摘要

Accessing large, manually annotated audio databases in an effort to create robust models for emotion recognition is a notably difficult task, handicapped by the annotation cost and label ambiguities. On the contrary, there are plenty of publicly available datasets for emotion recognition which are based on facial expressivity due to the prevailing role of computer vision in deep learning research, nowadays. Thereby, in the current work, we performed a study on cross-modal transfer knowledge between audio and facial modalities within the emotional context. More concretely, we investigated whether facial information from videos could be used to boost the awareness and the prediction tracking of emotions in audio signals. Our approach was based on a simple hypothesis: that the emotional state's content of a person's oral expression correlates with the corresponding facial expressions. Research in the domain of cognitive psychology was affirmative to our hypothesis and suggests that visual information related to emotions fused with the auditory signal is used from humans in a cross-modal integration schema to better understand emotions. In this regard, a method called dacssGAN (which stands for Domain Adaptation Conditional Semi-Supervised Generative Adversarial Networks) is introduced in this work, in an effort to bridge these two inherently different domains. Given as input the source domain (visual data) and some conditional information that is based on inductive conformal prediction, the proposed architecture generates data distributions that are as close as possible to the target domain (audio data). Through experimentation, it is shown that classification performance of an expanded dataset using real audio enhanced with generated samples produced using dacssGAN (50.29% and 48.65%) outperforms the one obtained merely using real audio samples (49.34% and 46.90%) for two publicly available audio-visual emotion datasets. (C) 2019 The Authors. Published by Elsevier B.V.
机译:访问大型手动注释的音频数据库,以创建用于情感识别的强大模型是一个毫无疑问的任务,通过注释成本和标签歧义。相反,由于计算机愿景在深度学习研究中的普遍作用,存在基于面部致力的情感识别的大量公开可公开数据集。因此,在当前的工作中,我们对情绪背景中的音频和面部方式之间的跨模型转移知识进行了研究。更具体地,我们调查了视频中的面部信息是否可用于提高音频信号中情绪的意识和预测跟踪。我们的方法是基于一个简单的假设:人类口腔表达的情绪状态的内容与相应的面部表情相关联。认知心理学领域的研究对我们的假设是肯定的,并表明与与听觉信号融合的情绪相关的视觉信息从人类中使用跨模型集成模式来更好地了解情绪。在这方面,在这项工作中引入了一种称为Dacssgan的方法(代表域适应条件半监督生成的对抗网络),努力弥合这两个固有的不同域。给定作为输入源域(可视数据)和基于感应保形预形预测的一些条件信息,所提出的架构产生尽可能接近目标域(音频数据)的数据分布。通过实验,表明使用使用Dacssgan(50.29%和48.65%)产生的生成样本来增强的扩展数据集的分类性能(50.29%和48.65%)优于仅使用真实音频样本(49.34%和46.90%)的两种公开可用的样品视听情感数据集。 (c)2019年作者。由elsevier b.v出版。

著录项

  • 来源
    《Neurocomputing》 |2020年第jul15期|331-344|共14页
  • 作者单位

    Maastricht Univ Dept Data Sci & Knowledge Engn Sint Servaasklooster 39 NL-6211 TE Maastricht Netherlands;

    Maastricht Univ Dept Data Sci & Knowledge Engn Sint Servaasklooster 39 NL-6211 TE Maastricht Netherlands;

    Maastricht Univ Dept Data Sci & Knowledge Engn Sint Servaasklooster 39 NL-6211 TE Maastricht Netherlands;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Domain adaptation; Conformal prediction; Generative adversarial; Networks;

    机译:域适应;保形预测;生成对抗性;网络;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号