A Spoken Dialogue System (SDS) receives data relating to user speech signals, extracts an acoustic feature set (eg. pitch, energy, spectral/filter features, MFCCs, jitter or shimmer) from the signal, determines an action (eg. “select”) via a trained dialogue (policy) model, outputs information/text specified by the action (eg. location), and predicts a success measure (eg. interaction naturalness, dialogue length, user satisfaction) using the acoustic features via a trained classifier (eg. Hidden Markov, neural network, random forest), to be input into a reward function for each dialogue to indicate performance during training. A system (“belief”) state (ie. possible values of a slot, eg. “low, mid or high” for slot “price”) may be updated based on the speech input using a state tracker model (eg. a Partially Observable Markov Decision Process model) via the policy model to generate the success measure. Success and acoustic features are assumed to be related (eg. a slow speech rate indicates a lack of engagement, decreasing the likelihood of achieving the goal).
展开▼