This paper reports our speech accent and genderrecognition system for the Vietnamese language.Prior studies have shown that the temporal structureof speech also contains significant cues for speechaccent and gender. However, conventional CNNcannot have large filter size as it increases thenetwork complexity. Inspired by the success ofWaveNet, we propose using the dilatedconvolutional neural network (dilated-CNN) withskip- and residual-connection to better capture thespeech temporal structure. The experiment resultsshow that our proposed architecture achieves higherperformance compared to non-dilated CNN.
展开▼