...
首页> 外文期刊>Sensors and materials >Dual-input Control Interface for Deep Neural Network Based on Image/Speech Recognition
【24h】

Dual-input Control Interface for Deep Neural Network Based on Image/Speech Recognition

机译:基于图像/语音识别的深度神经网络双输入控制接口

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

The objective of this study was to design a control interface for dual-input video/audio recognition consisting of two input interface systems, hand posture and speech recognition, with the use of specific hand postures or voice commands for control without the need for wearable devices. Original video camera images were preprocessed for hand posture recognition, and the face in the image was used as the reference point and identified using the Adaboost classifier. An image of a specific size was selected as the recognition input image to increase the recognition speed. A neural network comprising convolutional, activation, max pooling, and fully connected layers was used to classify and recognize hand posture images as well as speech. Long short-term memory (LSTM) in a recurrent neural network (RNN) was used to achieve speech recognition. Speech features were extracted by preprocessing, and Mel-frequency cepstral coefficients (MFCCs) and a fast Fourier transform (FFT) were then used to convert the signals from the time domain to the frequency domain. The frequency domain signals subsequently underwent a discrete cosine transform through triangular bandpass filters to derive MFCCs as the speech eigenvalue input. The speech feature parameters were then input to the LSTM neural network to make predictions and achieve speech recognition. Experimental results showed the image/speech dual-input control interface had good sound recognition capability, supporting the findings of this study.
机译:这项研究的目的是设计一种用于双输入视频/音频识别的控制界面,该界面由两个输入界面系统(手势和语音识别)组成,使用特定的手势或语音命令进行控制,而无需穿戴式设备。对原始摄像机图像进行预处理以进行手部姿势识别,并将图像中的面部用作参考点并使用Adaboost分类器进行识别。选择特定尺寸的图像作为识别输入图像以提高识别速度。包含卷积层,激活层,最大池层和完全连接层的神经网络用于分类和识别手势图像以及语音。递归神经网络(RNN)中的长短期记忆(LSTM)用于实现语音识别。通过预处理提取语音特征,然后使用梅尔频率倒谱系数(MFCC)和快速傅里叶变换(FFT)将信号从时域转换到频域。随后,频域信号经过三角带通滤波器进行离散余弦变换,以导出MFCC作为语音特征值输入。然后将语音特征参数输入到LSTM神经网络以进行预测并实现语音识别。实验结果表明,图像/语音双输入控制界面具有良好的声音识别能力,支持本研究的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号