Modern automatic spoken dialogue systems cover a wide range of applications. There are systems for hotel reservations, restaurant guides, systems for travel and timetable information, as well as systems for automatic telephone-banking services. Building the different components of a spoken dialogue system and combining them in an optimal way such that a reasonable dialogue becomes possible is a complex task because during the course of a dialogue, the system has to deal with uncertain information. In this thesis, we use statistical methods to model and combine the system's components. Statistical methods provide a well-founded theory for modeling systems where decisions have to be made under uncertainty. Starting from Bayes' decision rule, we define and evaluate various statistical models for these components, which comprise speech recognition, natural language understanding, and dialogue management. The problem of natural language understanding is described as a special machine translation problem where a source sentence is translated into a formal language target sentence consisting of concepts. For this, we define and evaluate two models. The first model is a generative model based on the source-channel paradigm. Because the word context plays an important role in natural language understanding tasks, we use a phrase-based translation system in order to take local context dependencies into account. The second model is a direct model based on the maximum entropy framework and works similar to a tagger. For the direct model, we define several feature functions that capture dependencies between words and concepts. Both methods have the advantage that only source-target pairs in the form of input-output sentences must be provided for training. Thus, there is no need to generate grammars manually, which significantly reduces the costs of building dialogue systems for new domains. Furthermore, we propose and investigate a framework based on minimum error rate training that results in a tighter coupling between speech recognition and language understanding. This framework allows for an easy integration of multiple knowledge sources by minimizing the overall error criterion. Thus, it is possible to add language understanding features to the speech recognition framework and thus to minimize the word error rate, or to add speech recognition features to the language understanding framework and thus to minimize the slot error rate. Finally, we develop a task-independent dialogue manager using trees as the fundamental data structure. Based on a cost function, the dialogue manager chooses the next dialogue action with minimal costs. The design and the task-independence of the dialogue manager leads to a strict separation of a given application and the operations performed by the dialogue manager, which simplifies porting an existing dialogue system to a new domain. We report results from a field test in which the dialogue manager was able to choose the optimal dialogue action in 90% of the decisions. We investigate techniques for error handling based on confidence measures defined for speech recognition and language understanding. Furthermore, we investigate the overall performance of the dialogue system when confidence measures from speech recognition and natural language understanding are incorporated into the dialogue strategy. Experiments have been carried out on the TelDir database, which is a German in-house telephone directory assistance corpus, and on the Taba database, which is a German in-house train time scheduling task.
展开▼