Contents
This lecture provides an overview of methods for human speech and emotion recognition to enhance the effectiveness and user-friendliness of human-machine interfaces.
Speech is the most natural way of communication between humans. Even complex thoughts and wishes may be expressed quite easily using spoken language. Notably consumer electronics and computer interfaces generally require the user to follow cumbersome menu structures, e.g., in order to record a TV movie. The use of spoken language would make the control of these devices much easier and more natural.
Furthermore, the recognition of the emotional user state seems often essential to understand the true meaning of what has been said or to adapt the machine's behaviour to the user's moods. Humans easily grasp the emotional state of somebody just from the tone of his/her voice or the facial expression. Apart from the capability of recognising emotions on the basis of voice or face analysis, techniques exist to recognise emotions from human bio-signals, such as the skin temperature or conductivity. The underlying classification techniques are related to those used in automatic speech recognition.
Topics
- Speech Recognition.
This part of the lecture addresses the automatic recognition of speech involving multiple processing steps:- Pre-processing of the acoustic speech signal to extract the meaningful information (feature extraction)
- Theory of statistical Hidden Markov Models (HMMs) that are used to model speech
- Algorithms for searching and pattern matching (Viterbi search, beam search)
- Generation of pronunciation lexica, grammars and language models
- Speaker adaptation techniques
- Confidence measures
- Neural Networks
- Applications for speech recognition systems
- Emotion Recognition.
This part of the course will cover several aspects regarding the automatic detection of a person's emotional state:- The definition of "emotion" in the particular context of emotional state recognition
- Extraction of special emotion features from speech
- Recognition of the facial expression from the video/image of the user
- Recognition of the emotional state from bio-signals, such as heart rate, skin temperature, skin conductivity etc.
- Most commonly used statistical classifiers in this field, including Neural Networks or support vector machines
- Applications for emotion recognition
- Exercises and Practicals.
Development of a speech recogniser using the Cambridge University Hidden Markov Model Toolkit (HTK), including the recording of small speech databases and the pre-processing of the recorded speech, training of statistical acoustic models of different complexities and accuracies, building of grammars and, finally, the recognition process.