Dialogue Systems

Introduction

"I would like to travel to Ulm tonight, when is the next train? And how can I get from the train station to the university?" For the employee of the train information a routine question, for enthusiasts in research and industry a technological challenge. Human-computer interfaces based on spoken natural language dialogues are conceived and tested in the "Dialogue Systems" group.

The "Dialogue Systems" group has placed its research and educational focus on the development and evaluation of human-computer dialogue systems, semantic analysis of naturally spoken language and statistic modelling techniques. The initially quoted orientation problem represents one of the application scenarios for the new generation human-computer interfaces: Computer-assisted processing of naturally spoken language, to provide users an easy access to information and applications.

Why speech?

Spoken language is the primary human communication medium. It's taken so granted for us, that we are normally not concerned about it - we just understand the words and the sentences. But the user still requires primitive aids like keyboard and mouse to control the highly developed computer, which already surpassed human with respect to a variety of abilities. Spoken language is not understood sufficiently by the computer in order to be made available as an exclusive input possibility. However recently, technology achieved product level in some application fields. An increasing number of dictation and naturally spoken language dialogue programmes have become commercially available. They make the automatic speech recognition, i.e. the computer-based conversion of spoken language into text, interesting for the professional as well as for the wide range of private users.

However, the higher levels of speech processing are still mostly reserved to research. Among these are the robust semantic analysis of naturally spoken language, i.e. the textual understanding of a word sequence recognised by machine, and the dialogue ability enabling the system to ask the user questions and to provide him with information. However, the modelling of these specific levels allows to develop totally new forms of interaction and application fields of spoken language technology.

Our Research

The Dialogue Systems Group has placed its general research focus on the development and evaluation of user-friendly Spoken Language Dialogue Systems (SLDS). This objective is based on the following major aspects: adaptive dialogue management, assistiveness as well as evolution and usability issues.

With research and development in the area of SLDS having made rapid progress over the last decade, these systems are proliferating for a large variety of applications in an increasing number of languages. A significant evolution could be observed from systems supporting isolated word recognition for a limited vocabulary up to continuous speech recognition featuring large vocabularies. Speaker-adaption represents an essential condition to make this evolution tractable. However, this trend towards adaptiveness has not yet been applied to higher language levels. Notably the Dialogue Management (DM) is generally conceived by system developers for standard users regardless of different interaction styles, user moods and the situation in which the dialogue takes place. It is becoming obvious that the performance of SLDS may vary significantly for different users as well as for the same user during different dialogues and conditions of use.

In order to improve the user-friendliness and ease of use of SLDS, the introduction of context data, including user profiles, emotions and the situation of use into an adaptive framework to DM seems needful. It may be based on the following factors:

  • A user profile may contain personal user information and preferences with respect to a particular application as well typical user-system interaction patterns. This increases the systems ability to engage in an effective communication (e.g. by avoiding to ask the user for information that may be inferred from the profile). 
  • Human emotion is defined as an interaction between mental and bodily processes. In contrast to the cognitive aspect, physiological aspects may in part be measured and objectively evaluated. To achieve this, emotion may be detected based on a feature extraction from the acoustic speech signal or video captures of mimics and gestures. Alternatively, the human body can be equipped with sensors to measure physiological values, like skin conductance and muscle activity, which are considered as a projection of the human emotion. 
  • Notably in safety critical environments, such as in cars, the situation of use in which the dialogue interaction takes place becomes an important issue. For instance, multisensor-based safety assistants can capture the environment and traffic, on the basis of which a dynamic situation model may be established.

We are developing and validating a method enabling adaptive human-machine spoken language dialogue interaction. The approach relates the user emotion and situation of use to the corresponding dialogue interaction history and elaborates suggestions for the DM to adapt its control strategy accordingly. In this particular context we also investigate whether and to what extent rule-based methods to DM are suitable in order to model complex dialogues.

Another research topic related to adaptiveness and flexibility investigates the problem of automatic natural language understanding in SLDS. The focus is on the design of a stochastic parsing component. Today's state-of-the-art rule-based methods to natural language understanding provide good performance in limited applications for specific languages. However, the manual development of such a parsing component is costly, as each application and language require their own adaptation or, in the worst case, a completely new implementation. Statistical modeling techniques replace the commonly-used hand-crafted rules to convert the speech recognizer output into a semantic representation. The statistical models are derived from the automatic analyses of large corpora of utterances with their corresponding semantic representations. Using the same software, the stochastic models are trained on the application- and language-specific data sets. The human effort in component porting is therefore limited to the task of data labeling, which is much simpler than the design, maintenance and extension of the grammar rules.

Many existing SLDS are either very limited in the scope of domain functionality or require a rather cumbersome interaction. With more and more application domains becoming available, ranging from unified messaging to trip planning and appointment scheduling, it seems obvious that the current interfaces need to be rendered more efficient. The possibility to construct and to manage complex tasks and interdependencies with these applications requires a high cognitive burden from the user, which may be dangerous in certain environments (e.g. driver distraction). Instead of preventing the use of such applications, however, it seems necessary to relieve the user as much as possible from the need to manage the complexities involved alone in his mind.

In our vision, the SLDS should serve as an integrated assistant, i.e. it should be able to collaborate with the user to find a solution that fits the user's requirements and constraints and that is consistent with the system's world knowledge. In particular, the system's world model should include knowledge about dependencies between certain domains: In a traveling sales person scenario, for instance, the system could automatically calculate route durations and check for parking space depending on the user's calendar.

As a major step forward, commercial SLDSs have matured from technology-driven prototypes to business solutions. This means that systems can be copied, ported, localised, maintained, and modified to fit a range of customer and end-user needs without fundamental innovation. This is what contributes to creating an emerging industry. Furthermore, in many research laboratories, focus is now on combining speech with other modalities, such as pen-based hand-writing and 2D gesture input, and graphics output, such as images, maps, lip movements, animated agents, or text. An additional dimension which influences development is the widening context of use. Mobile devices, in particular, such as mobile phones, in-car devices, PDAs and other small handheld computers open up a range of new application opportunities for unimodal as well as multimodal SLDSs. Many research issues still remain to be solved. Two issues of critical importance are evaluation and usability. Systems evaluation is crucial to ensure, e.g., system correctness, appropriateness, and adequacy, while usability is crucial to user acceptance.

In this research initiative, we survey the current baseline and progress on SLDS evaluation and usability, and discuss what we have learned and where we are today. Various metrics and strategies have been proposed for evaluating the usability of SLDSs. By contrast with technical evaluation, usability evaluation of SLDSs is to a large extent based on qualitative and subjective methods. Ideally, we would prefer quantitative and objective usability evaluation scores which can be objectively compared to scores obtained from evaluation of other SLDSs. However, many important usability issues seem unlikely to be subjected to objective quantification in the foreseeable future, and expert evaluation is sometimes highly uncertain or unavailable. Nevertheless, although important problems remain, there actually exists a rather strong baseline for evaluating the usability of unimodal SLDSs.

How to evaluate the usability of multimodal SLDSs remains an open research issue in many respects. However, we are not clueless in addressing this issue, since it would seem obvious, for a start, to draw on methods and criteria from SLDS usability evaluation. The issue then becomes one of deciding what is (not) transferable and which new evaluation criteria and metrics are required.