首页> 外文会议>International Joint Conference on Neural Networks >DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging
【24h】

DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging

机译:基于DNN的舌音成像技术的声转发音反演

获取原文

摘要

Speech sounds are produced as the coordinated movement of the speaking organs. There are several available methods to model the relation of articulatory movements and the resulting speech signal. The reverse problem is often called as acoustic-to-articulatory inversion (AAI). In this paper we have implemented several different Deep Neural Networks (DNNs) to estimate the articulatory information from the acoustic signal. There are several previous works related to performing this task, but most of them are using ElectroMagnetic Articulography (EMA) for tracking the articulatory movement. Compared to EMA, Ultrasound Tongue Imaging (UTI) is a technique of higher cost-benefit if we take into account equipment cost, portability, safety and visualized structures. Seeing that, our goal is to train a DNN to obtain UT images, when using speech as input. We also test two approaches to represent the articulatory information: 1) the EigenTongue space and 2) the raw ultrasound image. As an objective quality measure for the reconstructed UT images, we use MSE, Structural Similarity Index (SSIM) and Complex- Wavelet SSIM (CW-SSIM). Our experimental results show that CW-SSIM is the most useful error measure in the UTI context. We tested three different system configurations: a) simple DNN composed of 2 hidden layers with 64x64 pixels of an UTI file as target; b) the same simple DNN but with ultrasound images projected to the EigenTongue space as the target; c) and a more complex DNN composed of 5 hidden layers with UTI files projected to the EigenTongue space. In a subjective experiment the subjects found that the neural networks with two hidden layers were more suitable for this inversion task.
机译:语音是作为各说话器官的协调运动而产生的。有几种可用的方法来模拟发音运动和所产生的语音信号之间的关系。反向问题通常称为声学-发音反演(AAI)。在本文中,我们已经实现了几种不同的深度神经网络(DNN),以从声信号中估计发音信息。以前有几项与执行此任务有关的工作,但其中大多数都是使用电磁关节运动(EMA)来跟踪关节运动。与EMA相比,如果我们将设备成本,便携性,安全性和可视化结构考虑在内,那么超声舌成像(UTI)就是一种具有更高成本效益的技术。可见,我们的目标是在使用语音作为输入时训练DNN以获取UT图像。我们还测试了两种表示发音信息的方法:1)EigenTongue空间和2)原始超声图像。作为重建UT图像的客观质量度量,我们使用MSE,结构相似性指数(SSIM)和复数小波SSIM(CW-SSIM)。我们的实验结果表明,CW-SSIM是UTI上下文中最有用的错误度量。我们测试了三种不同的系统配置:a)由2个隐藏层组成的简单DNN,以UTI文件的64x64像素为目标; b)相同的简单DNN,但将超声图像投影到本征舌空间作为目标; c)以及由5个隐藏层组成的更复杂的DNN,其中UTI文件投影到EigenTongue空间。在主观实验中,受试者发现具有两个隐藏层的神经网络更适合此反演任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号