A companion system requires the perception and interpretation of its environment in order to detect potential users within a group of persons and to facilitates an interaction involving facial expressions, language and gesture. In this sub-project, the important tasks of environment perception and modeling as well as gesture recognition are solved. The environment perception is based on methods for multi-sensor fusion, information fusion and temporal filtering and uses the random finite set theory to allow for the simultaneous estimation of the objects' states and the object individual existence probabilities. The detection, tracking and classification of persons and other objects is realized using a multi-sensor setup while the classification of user gestures using of hidden Markov models (HMMs) is purely image based. A robust gestures classification is realized using static and dynamic features which are extracted from the video images. Recent research results show that the color information is also appropriate to improve the feature extraction. The segmentation using color information and 3D information provides a high degree of robustness to measurement noise, occlusions, brightness variations and background perturbations.