Overview

The Dialogue Systems Group has placed its general research focus on the development and evaluation of user-friendly Spoken Language Dialogue Systems (SLDS). This objective is based on the following four major aspects: robust and flexible component technologies, approaches to adaptive dialogue management and intelligent user assistance as well as evaluation and usability issues.

Two interdisciplinary initiatives form the research framework.

The Dialogue Systems Group is a member of the Competence Center Automotive Electronics and Information Systems established in conjunction with other Engineering and Computer Science Institutes at Ulm University.

The Dialogue Systems Group is also joint founder of the intersdisciplinary Competence Center Perception and Interactive Technologies. Research groups from the Computer and Engineering Science Institute, the Institutes of Mathematics and Medicine at Ulm University aim at developing innovative technologies in different application domains and settings for the human-computer interaction. Major research areas include sensor-based models for perception, learning mechanisms and adaptivity, interactive systems in networked applications, ubiquitous computing, multimedia and visualisation as well as spoken language dialogue systems interaction and multimodality. The center proposes a framework for fundamental and applied research and combines different interdisciplinary issues. Furthermore, it offers multiple points of contact to the Competence Center Automotive Electronics and Information Systems.

Presentation of the Dialogue Systems Group and the Competence Center Perception and Interactive Technologies by Dipl.-Inform. Dirk Bühler at the Hannover Messe International, Germany, in April 2005.

Robustness

Speech-based Human-Computer Interfaces are becoming increasingly popular. Notably in cars, safety and convenience issues require powerful human-computer interfaces in order to manipulate complex functionalities and devices.

However, in certain environments and conditions of use a number of factors are responsible for degradation of speech quality. If the speech signal is recorded in a noisy environment (e.g. in a car for hands-free telephony or SLDSs), the speech signal is superposed by background noise. This leads to difficulties in reliable speech recognition. Besides, the speech signal may be distorted by perturbed microphone frequency responses or acoustic multipath propagation. If the speech signal is transmitted via telephone it is subject to band limitation resulting in weak quality.

In all these cases a reconstruction of the distorted or band limited speech signal seems to be an interesting approach. In case of a perturbed microphone frequency response or an acoustic multipath propagation the SNR of particular frequency becomes problematic. An appropriate algorithm may detect these perturbed frequency samples and replace them by artificially generated ones. The speech reconstruction may serve as a noise reduction system or improve the performance of subsequent speech analysis components. Since the gap in the spectrum of the speech signal is expected to be narrow, there will be no real speech quality enhancement in terms of more pleasant sounding speech but in terms of lower noise or improved performance of the speech recognition component.

The extraction of a single voice from a babble of multiple speakers and background noise, sometimes referred to as the cocktail party problem, is also a matter of the SLDSs robustness in critical conditions of use. When several people in the same room are conversing at the same time, it is remarkable that a person is able to concentrate on one of the speakers and to listen to his or her speech. Extracting a single voice from multiple speakers and background noise is a highly non-trivial task, and the human ear's performance is still unsurpassed in this situation. Without any additional signal processing, the reduction of the accuracy of speech recognition is significant in the presence of background noise or when several persons speak simultaneously.

Adaptiveness

With research and development in the area of SLDS having made rapid progress over the last decade, these systems are proliferating for a large variety of applications in an increasing number of languages. A significant evolution could be observed from systems supporting isolated word recognition for a limited vocabulary up to continuous speech recognition featuring large vocabularies. Speaker-adaption represents an essential condition to make this evolution tractable. However, this trend towards adaptiveness has not yet been applied to higher language levels. Notably the Dialogue Management (DM) is generally conceived by system developers for standard users regardless of different interaction styles, user moods and the situation in which the dialogue takes place. It is becoming obvious that the performance of SLDS may vary significantly for different users as well as for the same user during different dialogues and conditions of use.

In order to improve the user-friendliness and ease of use of SLDS, the introduction of context data, including user profiles, emotions and the situation of use into an adaptive framework to DM seems needful. It may be based on the following factors:

  • A user profile may contain personal user information and preferences with respect to a particular application as well typical user-system interaction patterns. This increases the systems ability to engage in an effective communication (e.g. by avoiding to ask the user for information that may be inferred from the profile). 
  • Human emotion is defined as an interaction between mental and bodily processes. In contrast to the cognitive aspect, physiological aspects may in part be measured and objectively evaluated. To achieve this, emotion may be detected based on a feature extraction from the acoustic speech signal or video captures of mimics and gestures. Alternatively, the human body can be equipped with sensors to measure physiological values, like skin conductance and muscle activity, which are considered as a projection of the human emotion. 
  • Notably in safety critical environments, such as in cars, the situation of use in which the dialogue interaction takes place becomes an important issue. For instance, multisensor-based safety assistants can capture the environment and traffic, on the basis of which a dynamic situation model may be established.

We are developing and validating a method enabling adaptive human-machine spoken language dialogue interaction. The approach relates the user emotion and situation of use to the corresponding dialogue interaction history and elaborates suggestions for the DM to adapt its control strategy accordingly. In this particular context we also investigate whether and to what extent rule-based methods to DM are suitable in order to model complex dialogues.

Another research topic related to adaptiveness and flexibility investigates the problem of automatic natural language understanding in SLDS. The focus is on the design of a stochastic parsing component. Today's state-of-the-art rule-based methods to natural language understanding provide good performance in limited applications for specific languages. However, the manual development of such a parsing component is costly, as each application and language require their own adaptation or, in the worst case, a completely new implementation. Statistical modeling techniques replace the commonly-used hand-crafted rules to convert the speech recognizer output into a semantic representation. The statistical models are derived from the automatic analyses of large corpora of utterances with their corresponding semantic representations. Using the same software, the stochastic models are trained on the application- and language-specific data sets. The human effort in component porting is therefore limited to the task of data labeling, which is much simpler than the design, maintenance and extension of the grammar rules.

Intelligence

Many existing SLDS are either very limited in the scope of domain functionality or require a rather cumbersome interaction. With more and more application domains becoming available, ranging from unified messaging to trip planning and appointment scheduling, it seems obvious that the current interfaces need to be rendered more efficient. The possibility to construct and to manage complex tasks and interdependencies with these applications requires a high cognitive burden from the user, which may be dangerous in certain environments (e.g. driver distraction). Instead of preventing the use of such applications, however, it seems necessary to relieve the user as much as possible from the need to manage the complexities involved alone in his mind.

In our vision, the SLDS should serve as an integrated assistant, i.e. it should be able to collaborate with the user to find a solution that fits the user's requirements and constraints and that is consistent with the system's world knowledge. In particular, the system's world model should include knowledge about dependencies between certain domains: In a traveling sales person scenario, for instance, the system could automatically calculate route durations and check for parking space depending on the user's calendar.

We are developing a logic-based reasoning approach to be used for efficient and user-friendly human-machine dialogues. User-friendliness may be achieved by supporting the Dialogue Manager (DM) with a module for problem solving, the Problem Assistance (PA). In particular, the reasoning component aims at enabling the DM to engage in explanation dialogues on the basis of the PA's inference process.

In our approach, the DM is viewed as the central interface component between the user and the system. Its strategy consists of constructing a shared set of constraints in a mixed initiative interaction with the user. The PA supports the DM by providing a common knowledge representation based on logical constraints. On this basis, a rule-based reasoning service combines the constraints with the PA's domain theory. Finally, the PA provides proof traces for its inferences in order to enable the DM to engage in explanation and conflict resolution dialogues.

Usability

As a major step forward, commercial SLDSs have matured from technology-driven prototypes to business solutions. This means that systems can be copied, ported, localised, maintained, and modified to fit a range of customer and end-user needs without fundamental innovation. This is what contributes to creating an emerging industry. Furthermore, in many research laboratories, focus is now on combining speech with other modalities, such as pen-based hand-writing and 2D gesture input, and graphics output, such as images, maps, lip movements, animated agents, or text. An additional dimension which influences development is the widening context of use. Mobile devices, in particular, such as mobile phones, in-car devices, PDAs and other small handheld computers open up a range of new application opportunities for unimodal as well as multimodal SLDSs. Many research issues still remain to be solved. Two issues of critical importance are evaluation and usability. Systems evaluation is crucial to ensure, e.g., system correctness, appropriateness, and adequacy, while usability is crucial to user acceptance.

In this research initiative, we survey the current baseline and progress on SLDS evaluation and usability, and discuss what we have learned and where we are today. Various metrics and strategies have been proposed for evaluating the usability of SLDSs. By contrast with technical evaluation, usability evaluation of SLDSs is to a large extent based on qualitative and subjective methods. Ideally, we would prefer quantitative and objective usability evaluation scores which can be objectively compared to scores obtained from evaluation of other SLDSs. However, many important usability issues seem unlikely to be subjected to objective quantification in the foreseeable future, and expert evaluation is sometimes highly uncertain or unavailable. Nevertheless, although important problems remain, there actually exists a rather strong baseline for evaluating the usability of unimodal SLDSs.

How to evaluate the usability of multimodal SLDSs remains an open research issue in many respects. However, we are not clueless in addressing this issue, since it would seem obvious, for a start, to draw on methods and criteria from SLDS usability evaluation. The issue then becomes one of deciding what is (not) transferable and which new evaluation criteria and metrics are required.