An end-to-end deep-learning-based system that can solve both ASR and TTS problems jointly using unpaired text and audio samples is disclosed herein. An adversarially-trained approach is used to generate a more robust independent TTS neural network and an ASR neural network that can be deployed individually or simultaneously. The process for training the neural networks includes generating an audio sample from a text sample using the TTS neural network, then feeding the generated audio sample into the ASR neural network to regenerate the text. The difference between the regenerated text and the original text is used as a first loss for training the neural networks. A similar process is used for an audio sample. The difference between the regenerated audio and the original audio is used as a second loss. Text and audio discriminators are similarly used on the output of the neural network to generate additional losses for training.
展开▼