A neural network architecture is proposed to identify human emotions on audio recordings. Emotions are understood as fear, joy, sadness, anger, calmness, and neutrality. Library data are used for training. The psychophysical properties of an audio recording are saved by converting an audio file into a spectrogram image with a chalk scale (chalk spectrogram). Such a spectrogram is an empirically chosen logarithmic dependence of the volume of sound vibrations perceived by human hearing organs on their frequency. Then methods for classifying graphic files are applied, including convolutional layers (the fragmental multiplication of pixel value matrices by the given matrices with the possible reduction of the picture dimension).
展开▼