A central component of machine learning and specifically deep learning applications are the underlying data required for training and validation purposes. 

However, such data is often times of sensitive nature, which not only stands in a way of its publication in a scope Open-Access and Open-Data initiavies, but also the hinders the reproducibility of research results. Additionally cooperation research between the industry and academic institutes is often limited due to restrictions imposed by sensitive data and regulations, such as GDPR.

The STEALH project was created to bridge this gap through data synthesis. This rids training data of senstive user or company related information while retaining specific statistic characteristics of the data.


Fig. 1. Generative networks learning the data fidelity throughout the training process. Benjamin Schanzel, Mark Leznik


Fig. 2. Density chart of the several million synthetic datapoints, with clusters in pertinent areas.

While the current state of the art in machine learning is still experiencing a lot of attention, specifically in pertinent fields such as interpretable machine learning, training data privacy and sensitivity is still a novelty. This market as is has been widely recognized by the research community and companies alike, poised to reach $4.8 billion by 2027, with synthetic data being one of the key factors. Our project aims at delivering algorithms for synthesizing data, in our case mainly multidimensional time series data, and allowing for its publication and widespread open source usage.


Our aim is to publish our results in a reproducible manner, allowing for easy use and access. In conjuction with our work on data synthesis, we hence also look at the best way to provide  and version data, code and trainings paramters.

As seen in Fig. 1. the current Generative Adversarial Networks (GANs) are able to mimic the data fidelity, with the learning process being shown troughout the training. Specifically, when plotted in a large amount, as shown in the density plot in Fig. 2., clear statistical trends with minor deviations are observed. This goes to show a not 1:1 data replication, but rather a synthesis of entirely new data, while retaining the statistical properties of the original data. 

In Fig. 3., our newest results can be seen, hereby, we were able to synthesize class-specific time series data using Convolutional Neural Networks (CNNs) in a GAN architecture. This allows for generating data of different properties from a dataset and additionally vastly  increases the output length of the data (we have produced results of over 15.000 timesteps) and computation time due to the use of CNNs.

Fig. 3. Synthetic ECG time series data created using conditional CNNs. Naomi Sagawa, Mark Leznik



The Stealth Project research is funded by the Vector Stiftung.