### Various mathematical topics

Initially I was interested in a wide range of topics including basic probability (papers 15, 21), the Riemann ζ-function (paper 6), population dynamics (papers 1, 23), and ergodic theory.

### Mathematical consultant

As a mathematical consultant with the experimentalists at the MPI in Tübingen I worked on problems of quantitative anatomy and stereology (papers 22, 37) and on the design and evaluation of physiological or behavioral experiments (papers 31, 35, 44), which led to my interest in nonlinear systems (see below) and to acknowledgements in several experimental papers.

### Nonlinear system analysis

This topic arose from analysis and experimentation on the visuo-motor system of the fly leading to a fruitful cooperation with Tomaso Poggio (papers 4, 5, 10). In particular I was able to show that certain apparently simple classes of non-linear systems that were actually used by experimentalists to analyse behavioral or neuronal input-output systems, are so large and flexible that they can approximate essentially any system (papers 9, 12, 26). In the case of discrete systems this was later generalized by Cybenko (1988) to two layer neural networks with arbitrary nonlinearities; for further discussion see 'Turing universality'.

### Visual system

When David Marr visited the MPI in Tübingen for more extensive periodes of time he started a cooperation on stereo-vision with Tommy Poggio and myself (paper 11), where I learned a lot about the visual system and the simulation of the 'early' visual cortices, which I later continued with Manfred Fahle and others (papers 27, 30, 44), after David's early death.

### Auditory system and speech recognition

With various co-authors I worked on models of the auditory system concerning several topics: speech perception and recognition (papers 53, 60, 72, 95, 97, 99, 212, 213), speaker identification (papers 59, 69, 90), emotion recognition from speech (papers 202, 218, 220, 232), the cocktail party effect (papers 166, 167).

### Neuroinformatik

This term has probably been created by Rolf Eckmiller around 1986/7 and the establishment of this field in Germany has been promoted quite early on by the BMBF, notably in the first joint program on 'information processing in neuronal architecture' (Informationsverarbeitung in neuronaler Architektur), coordinated by Werner von Seelen, with several 'young' German researchers participating in this new field (e.g. von der Malsburg, Eckmiller and myself).

### Cell Assemblies

In the group of Valentin Braitenberg at the Tübingen MPI we discussed a lot about Hebb's ideas concerning the representation of concepts in cell assemblies (paper 17 and much later 297) and Hebbian (correlative) synaptic plasticity (papers 19, 38, 40, 41). I tried to use information theoretical arguments on the optimal memory capacity of associative memories to understand the size and to some degree the structure that distributed cell assemblies should have (papers 14, 17, 25, 28, 33, 39, 49). Most of the ideas we developed during that time went into my book 'Neural Assemblies' (Springer 1982) on this topic. Perhaps the most important early result was that the neural activity in assemblies should be sparse.

### Hebbian synaptic plasticity

Hebb's original idea was that it could be synaptic plasticity that glues neurons together to form cell assemblies (it wires together what fires together - a much later popular slogan in neuroscience). In my book 'Neural Assemblies' (Springer 1982) I also introduced the mathematical concept of a simplified synaptic plasticity rule (as a 4-dimensional vector) and showed that a true interaction of pre- and postsynaptic activity is necessary to achieve a reasonably high memory capacity (see also papers 19, 41).

### Associative memory

Neural associative memory (NAM) turned out to be not only a good framework for the analysis of Hebb's ideas, but also a potentially useful technical memory device. From both perspectives it has been analysed extensively by various authors (e.g. David Willshaw, John Hopfield, Daniel Amit, and Misha Tsodyks) including myself (e.g. papers 14, 25, 28, 41, 56, 147, 241 and also my recent review article, paper 282). The technically most promising associative memory is the so-called Willshaw model (Willshaw et al 1969 in Nature) with binary synapses and sparse binary activity vectors, which I analyzed in 1980 (paper 14), leading to perhaps the first explicit realisation of the importance of sparseness, to my patent application for sparse associative memories in 1988 (US Patent #4,777,622), and to some projects on the technical hardware (VLSI) and software realization of sparse associative memories (papers 24, 42, 43, 51, 57).

### Storage capacity

I defined the storage capacity of an associative memory in an inormation theoretic way in terms of the mutual information between the stored and the retrieved patterns (paper 14). This basic idea gets a bit more complicated in the case of auto-association (instead of hetero-association) where the input has to contain at least some of the information of the output (see papers 46, 47, 78 and 86 for details).

### Sparse activity patterns

My work on Hebbian cell assemblies and associative memory has led me quite early on (around 1980, paper 14) to the supposition (or prediction) that neural representations should be sparse. At that time the experimental evidence for sparseness just startet to get accumulated, leading up to the widely cited papers by Olshausen and Field (e.g. 1996 in Nature, or 'Sparse coding with an overcomplete basis set ... ' in 1997). What one could safely say in 1980 was that the average neuronal spike frequency tends to go down from peripheral to central neuronal structures or representations.

### Sparseness principle

Sparseness seems to be a useful property of activity vectors and also of connectivity matrices. Both, sparse activity and sparse connectivity seem to be realized in the central nervous system, in particular in the mammalien cortex. But also in artificial neural networks and in many modern machine learning methods they turn out to be useful, for example as a constraint or an optimisation criterion for regularisation or complexity control (see Vladimir Vapnik's great book on Statistical Learning Theory (1998) , Hoyer 2002, 2004, Donoho and Elad 2003, or our papers 260, 286, 310). One of the first examples for this was the use of the L1-norm (which is related to sparseness) in SVM optimisation. The reasons behind this are not yet fully understood.

### Cortico-cortical projections

In the cerebral cortex one can distinguish two kinds of connections: local connections from each neuron to several thousand neurons in its immediate neighborhood, and long-range (so-called cortico-cortical) connections that connect thousands of neurons in one area to thousands of neurons in another area. These long-range connections are often bidirectional on the level of areas (not of single neurons) and the resulting networks of area-to-area connections or 'projections' have been extensively investigated in several animals by various 'tracing' methods (see the book by Young et al 'The Analysis of Cortical Connectivity', Springer 1995, and the CoCoMac database which was initiated by my former student Rolf Koetter, described in Neuroinformatics 2004). The combination of short-range 'intracortical' and long-range 'cortico-cortical' connections offers an interesting connectivity scheme which is far from full connectivity and still allows information transfer between any pair of cortical neurons in about 3-4 steps. Both, the local connectivity and the cortico-cortical connections do not reach all neurons in their target zone, but those that are reached can presumably reach all the others in the target volume (see Braitenberg and Schüz, 1991, Cortex: Statistics and Geometry of Neuronal Connectivity). Moreover these connections are probably formed or at least refined by Hebbian learning. On the contrary, the lay-out of the cortico-cortical areal projection network is probably genetically determined.

### Bidirectional associative memory

This is a version of a hetero-associative memory where the learned connections c_{ij} from neuron i to neuron j are also used backwards from neuron j to neuron i. This makes it possible to use iterative retrieval not only for auto-association, but also for hetero-association: from x to y back to x', forward to y' and so on. This can be quite useful in applications and has been described and analysed by several authors (e.g. Kosko 1987, Hains and Hecht-Nielsen 1988, Sommer and Palm, papers 109, 123).

### Two systems (fast and slow)

The idea that human decision making involves two interacting 'systems' or 'processes' has repeatedly been proposed in psychology, often in relation to the 'belly' vs the 'brain' or intuition vs. reasoning. It is a cornerstone of the argumentation against the common assumption of 'rational decision makers' in economical theory. A good introduction to this topic is the book of Daniel Kahnemann 'Thinking - fast and slow', who received the Nobel prize in economy for his ideas.

### Hybrid technical systems

In applications of methods of traditional artificial intelligence, in particular in complex real-world applications, it is often useful to combine methods from classical propositional or rule-based reasoning with methods based on neural networks or machine learning for low-level data analysis. Such combinations are often called 'hybrid' systems or neuro-symbolic systems. Quite early on we had a project on hybrid systems called WINA ('Wissensverarbeitung in neuronaler Architektur', i.e. knowledge processing in neural architecture, papers 43, 51, 57). Apparently there are not really many success stories about hybrid systems, because the interfacing between the lower and the higher system levels is problematic, mainly due to the completely different methodologies and terminologies of the two communities involved. Perhaps in this endeavor we can learn something from the interaction of the two systems in human decision making (see above) that is recently being investigated.

### Turing universality

Alan Turing invented the abstract idea of a Turing machine to investigate the limits of computability. The claim that a class of computational mechanisms or devices is 'Turing universal' means that any computation that a Turing machine or a computer with unlimited memory supply can do, can also be done by a member of the class.

It is quite easy to see that neural networks can perform the basic Boolean operations and therefore can implement any mapping on a finite set or, equivalently, any *finite automaton*. This means that they have the full flexibility of computers or of Turing machines. If they are equipped with unlimited memory, for example in the form of paper for making notes or of a memory element that can store all the bits of a real number, they are Turing universal. This result actually goes back to 1956 (Kleene in the famous book 'Automata Studies' by Shannon and McCarthy). The same is obviously true for networks of associative memories, which we used in the MirrorBot project (see our book 'Biomimetic Neural Learning ... ', Wermter, Palm and Elshaw 2005) and subsequently for sentence understanding (papers 180, 201, 203, 211, 226). If we model the activity state of a neuron as a real number, each neuron in a network has in principle not only an unlimited, but an infinite memory, and these neural networks can be considered as 'real valued or analog Turing machines' which in principle have even more computational possibilities than Turing machines (see Hava Siegelman 'Neural Networks and Analog Computation: Beyond the Turing Limit', 1998). However, it is unclear how (or whether at all) this capacity can practically be used to solve classical computational problems.

### 'Deep' multilayer networks

These 'deep networks' or 'deep learning' methods have recently become very popular in pattern recognition (and to some extent in reinforcement learning) (for review see LeCun, Bengio and Hinton 'Deep learning', or Schmidhuber 'Deep learning in neural networks', both in 2015). In particular for complex problems (like understanding sentences, visual scenes, or complex human activities, or playing two-person games like Chess or Go) it turns out that multiple representations, possibly at different levels of 'generality', can be extremely useful. One reason for this may be the use of the 'lower' or initial parts of the network for transfer learning from similar problems. Theoretically it is known that two processing layers are in principle sufficient (Cybenko, 1988, and others independently at about the same time), but the complexity of a network with n layers may become much smaller for a specific problem when an additional layer is added (e.g. Parberry 'Circuit Complexitiy and Neural Networks', MIT 1994). I was perhaps the first one to mention this obvious effect (in 1979, paper 12)

### MirrorBot

MirrorBot was the name of a cooperative EU-project, where we developed a neural architecture for the integration of language, vision and action in order to get some neuro-modelling insight on mirror neurons (first described by Rizzolati). We also started to develop an associative memory network for the understanding of simple sentences.

### Understanding simple sentences

In the MirrorBot project we started to construct associative memory networks (conceived as networks of cortical areas) that can understand simple command sentences and generate action plans that would fulfill the commands. This work was an extension and concretization of our older ideas on cell assemblies. It was also ment as a demonstration that such a 'cortical network' can solve a comparatively hard cognitive task in a simple intuitive and neuroscientifically plausible way. For a restricted vocabulary and grammar we could demonstrate the functionality of the system on a robot (papers 153, 154, 173, 180, 203 and the book 'Biomimetic Neural Learning for Intelligent Robots' by Wermter, Palm and Elshaw 2005). Unfortunately we did not receive further funding for this project to demonstrate the usefulness for a broader linguistically more demanding scenario and to elaborate the relationship of the simulated modules to the functionality of corresponding cortical areas that are currently investigated in cognitive neuroscience. Our proposals for research in this direction (for example this one) were turned down.

### Applications of neural networks

Many of our Batchelor and Master students developed specific technical applications of artificial neural networks, in particular in pattern recognition (e.g. papers 82, 148, 150, 206, 236, 291), prediction, or control problems, where learning is essential. Our Ph.D. students typically also developed new or refined methods, and in the last years much of our work was focussed on emotion recognition (papers 206, 220, 298, 300, 309).

### Information fusion

This is an important application of artificial neural networks, where information or data from different sources, or from several subsequent time windows have to be combined to generate a classification or decision. Using multilayer networks one can study different hierarchical schemes of fusion at different levels (papers 95, 99, 137, 160, 265).

### Multi-sensory-motor systems

This was the topic of our collaborative research at Ulm university in the SFB 527: 'Integration symbolischer und subsymbolischer Informations-verarbeitung in adaptiven sensomotorischen Systemen', where we built a multilevel system architecture for an artificial autonomous agent (paper 33 and paper 105). Around the turn of the century we continued some of this research in the context of robot soccer, or RoboCup (papers 113, 120, 122, 125, 133, 140, 155, 159).

### Companion technology

This was the topic of collaborative research initiated later (around 2006) at Ulm and Magdeburg University in the SFB-TRR 62: 'Eine Companion Technologie für kognitive technische Systeme'. One important point was that the technical companion system has to use information fusion from several sensory channels as well as subsymbolic/symbolic integration to understand the intentions of the human 'companion' or user. Our institute was mainly focussing on sensory processing, information fusion and emotion recognition in this context (see also applications of neural networks).

### Dynamical (Kolmogorov-Sinai) entropy

This entropy was developed by Kolmogorov as a characterization of the speed of state-space mixing, chaoticity, or apparent randomness of a *dynamical system*. It has turned out to be an important 'isomorphism invariant' that helps to characterize a *dynamical system* up to isomorphism. It measures the conditional information of the next system state after one unit of time given approximate knowledge of the past states, i.e. the amount of uncertainty in the prediction of the next state from measurements of the past. So only truely chaotic systems can have nonzero entropy.

### Ergodic theory of dynamical systems

The simplest mathematical description of a *dynamical system* is a mapping f: X -> X on a 'state-space' X that describes the temporal evolution of the system from one state to the next in one timestep. Usually X is a subset of the n-dimensional real vectorspace and so it can be considered as a *topological space* and also as a* measure-space*. Mathematical ergodic theory uses this topological or measure theoretic structure to prove so-called 'ergodic theorems' which essentially say that the temporal average of a real-valued function h on the state space along a trajectory starting at some point x ∈ X is the same for (almost) all points in X and is equal to the mean value of h on all of X (with respect to some probability p on X). These theorems are useful for some classical arguments in statistical mechanics which assume that 'time mean equals space mean' (see the unpublished 13 Lectures). In the group of Rainer Nagel (AGFA) in Tübingen I worked on this topic for a while, in particular on the relation between ergodicity, other so-called 'mixing properties' of *dynamical systems* which measure how strongly the mapping f mixes the states in X, and the dynamical entropy of these systems (papers 2, 3).

### Generalized information theory

Classical Shannon information is defined for discrete random variables X, or equivalently for *partitions* of the underlying probability space Ω into the sets A_{n} = [X = n] = {w ∈ Ω: X(w) = n}, where n runs through all possible discrete values of X. It has very nice mathematical properties which were investigated in mathematical information theory during the 1950s and '60s. Already during that time Information was also used to formalize experimental results on human or animal perception on human language or music. When reading about this research my impression was that in some of these experiments and in particular in human language understanding it can sometimes be a bit artificial to enforce disjointness of the sets or propositions describing the meaning of words or sentences or human observations or perceptions. Therefore my idea was to extend Shannon information from proper *partitions* to sets of possibly overlapping propositions which are called *covers* in set theory or topology. Technically, it required a few additional tricks to develop this generalized information theory in such a way that as many as possible of the nice mathematical properties of Shannon information could be retained. I used these ideas first in a rather pure mathematical endeavour to understand the relationship between topological dynamical entropy and its probabilistic original, the Kolmogorov-Sinai entropy of *dynamical systems*. Later it turned out that a better understanding of all the new relationships requires a distinction of 3 different versions of information or entropy (paper 18) which I called ‚novelty‘, ‚surprise‘ and information in my book 'Novelty, Information and Surprise‘ (Springer 2012). In addition it was useful to define different types of *covers* (in particular so-called templates) and another completely new concept which today I consider to be the the most elegant, direct and easily accessible approach to information theory (both classical and generalized). It is the concept of a description, which (for a probability space (Ω, ** A**,p), see below) is simply a mapping d from Ω to the sigma-algebra

**with w ∈ d(w) for every w ∈ Ω.**

*A*Then the information of d is a random variable, namely I(w) = -log p(d(w)).

### Descriptions and templates

Both concepts are defined in a probabilistic setting, i.e. for a probability space (Ω, ** A**,p), where Ω is the basic space or state space containing all elementary events of interest,

*is a collection (usually a so-called*

**A***sigma-algebra*) of subsets of Ω, interpreted as statements about the elementary events, and p is the probability of these statements.

A

*description*is a mapping d from Ω to

*with w ∈ d(w) for every w ∈ Ω.*

**A**A

*cover*(of Ω) is a subset of

*whose union is Ω.*

**A**A

*template*is a

*cover*

**C**with the additional property that every w ∈ Ω has one smallest or most precise statement D about it in

**C**(i.e. w ∈ D ∈

**C**).