This paper deals with the problem of generating emotional speech within the Unit Selection approach to text to speech synthesis. By taking into account state-of-the-art research in different fields, from psychology to linguistics, we claim that a complex interplay between the phonetic level and the pragmatic level of language constitutes the basis of voice expression of emotions, and that the phonetic-pragmatics interplay can be accounted for in a text-to-speech system by providing accurate representations of contextually relevant discourse markers. The availability of an inventory of expressive cues implementing discourse markers, can improve the naturalness and expressivity of generated speech, moving toward the ambitious goal of emotional speech generation.
展开▼