Self-optimised efficiency, security and robustness for active replication

from 2015 to 2021


DFG HA 2207/10-1 and 10-2

OptSCORE is a research project of the Institute for Distributed Systems and the Junior Professorship for Security in Information Systems at the University of Passau.

Networked IT systems very often have high requirements in terms of reliability, availability and security. Replication of services is a fundamental mechanism to fulfill these requirements. In order to achieve scalability at the same time, there are many approaches, especially in the area of storage services, which manage with weakened consistency requirements. However, some system services require stronger consistency guarantees, e.g. coordination services such as ZooKeeper, the Namenode of HDFS or identity management services. Weak consistency models are also unsuitable for Byzantine errors, as divergent values cannot be distinguished from erroneous ones. The OptSCORE project therefore focuses on State-Machine Replication (SMR), a replication method for services with strong consistency requirements. SMR is based on distributed agreement and deterministic execution. The often sequential and thus deterministic execution is unacceptable with the increasing spread of multi-core systems, which has led to the development of deterministic schedulers.

In practice, there is a large range of configurable parameters, from the selection of different protocols to the definition of timeout values. We will first summarize the nature and extent of the influence of protocols, algorithms and parameters on the one hand and of environmental conditions and application behavior on the other hand on the overall behavior of an SMR system. We will focus on group communication and deterministic scheduling of concurrent executions. As a second step, concepts will be designed to automatically derive and continuously optimize the parameters.

A prototype implementation for a reconfigurable and self-adaptive group communication system as well as for a self-optimizing deterministic scheduler will be designed and integrated into a framework for replicated services, which will eventually enable practical evaluations. We expect that self-adapting systems will behave better than rigidly configured or non-configurable systems. The project is thus a fundamental contribution to ultimately bring SMR-based systems closer to practical application.

Related Publications


Köstler, J., Reiser, H.P., Hauck, F.J. and Habiger, G. 2023. Fluidity: location-awareness in replicated state machines. 38th ACM/SIGAPP Symp. on Appl. Comp. – SAC (Mar. 2023).
In planetary-scale replication systems, the overall response delay is greatly influenced by the geographical distances between client and server nodes. Current systems define the replica locations statically during startup time. However, the selected locations might be suboptimal for the clients, and the client request origin distribution may change over time, so a different replica placement may provide lower overall request latencies. In this work, we propose a locationaware replicated state machine that is able to adapt the geographic location of its replicas dynamically during runtime to locations geographically closer to client request origins. Our prototype is able to observe emerging optimization potentials and to reduce the overall request latency for the majority of clients by adapting its replica locations to the time-dependent optimum placement during real-world use case evaluations, whereby the absolute performance gain is dependent on the respective usage scenario.


Berger, C., Eichhammer, P., Reiser, H.P., Domaschka, J., Hauck, F.J. and Habiger, G. 2022. A survey on resilience in the IoT: taxonomy, classification, and discussion of resilience mechanisms. ACM Comp. Surv. 54, 7 (2022), 147:1-147:39.
Internet-of-Things (IoT) ecosystems tend to grow both in scale and complexity, as they consist of a variety of heterogeneous devices that span over multiple architectural IoT layers (e.g., cloud, edge, sensors). Further, IoT systems increasingly demand the resilient operability of services, as they become part of critical infrastructures. This leads to a broad variety of research works that aim to increase the resilience of these systems. In this article, we create a systematization of knowledge about existing scientific efforts of making IoT systems resilient. In particular, we first discuss the taxonomy and classification of resilience and resilience mechanisms and subsequently survey state-of-the-art resilience mechanisms that have been proposed by research work and are applicable to IoT. As part of the survey, we also discuss questions that focus on the practical aspects of resilience, e.g., which constraints resilience mechanisms impose on developers when designing resilient systems by incorporating a specific mechanism into IoT systems.


Köstler, J., Reiser, H.P., Habiger, G. and Hauck, F.J. 2021. SmartStream: towards Byzantine resilient data streaming. 36th Ann. ACM Symp. on Appl. Comp. – SAC (Virtual Event, Republic of Korea, Mar. 2021), 213–222.
Data streaming platforms connect heterogeneous services through the publish-subscribe paradigm. Currently available platforms provide protection against crash faults, but are not resistant against Byzantine faults like arbitrary hardware faults and intrusions. State machine replication can provide this protection, but the higher resource requirements and the more elaborated communication primitives usually result in a higher overall complexity and a non-negligible performance degradation. This is especially true for data streaming if the default textbook approach of integrating the service into a replicated state machine is followed without further adaptions. The standard state management with state logs and snapshots and without any partitioning scheme limits both performance and scalability in a way those systems become unusable in practice. That is why we propose SmartStream, a topic-based Byzantine fault-tolerant data streaming platform that harmonizes the competing concepts of both systems and leverages the specific characteristics of data streaming, namely the append-only semantics of the application state and its partitionable structure. We show its effectiveness in a prototype implementation and evaluate its performance. The evaluation results show a moderate drop in system throughput when compared to state-of-the-art data streaming platforms like Apache Kafka, but reasonable overall performance considering the stronger resilience guarantees.


Habiger, G., Hauck, F.J., Reiser, H.P. and Köstler, J. 2020. Self-optimising Application-agnostic Multithreading for Replicated State Machines. Int. Symp. on Rel. Distr. Sys. – SRDS (2020), 165–174.
State-machine replication (SMR) is a well-known approach for fault-tolerant services demanding fast recovery. It is not easy, however, to parallelise SMR in order to exploit modern multicore architectures. Two main approaches have been extensively studied; one focusing on request-level concurrency using prior knowledge, the other utilising application-agnostic and lock-level deterministic scheduling. We show that significant performance improvements for the latter approach require deterministic scheduler configurations to be dynamically adapted to the current application load during runtime. First, we summarise current research on parallel SMR execution. Second, an analysis of obstacles in lock-level deterministic multithreading approaches shows how static scheduler configurations can lead to poor performance when load on the system varies over time. Third, we present a simple yet effective automatic adaptation solution, which provides significantly better overall system behaviour compared to static configurations. This is demonstrated by evaluations using a full system setup.


Habiger, G. and Hauck, F.J. 2019. Systems support for efficient state-machine replication. Tagungsband des FB-SYS Herbsttreffens 2019 (Osnabrück, 2019).


Habiger, G., Hauck, F.J., Köstler, J. and Reiser, H.P. 2018. Resource-Efficient State-Machine Replication with Multithreading and Vertical Scaling. Proc. of the 14th Eur. Dep. Comp. Conf. (EDCC) (Iaşi, Romania, Sep. 2018).
State-machine replication (SMR) enables transparent and delayless masking of node faults. It can tolerate crash faults and malicious misbehavior, but usually comes with high resource costs, not only by requiring multiple active replicas, but also by providing the replicas with enough resources for the expected peak load. This paper presents a vertical resource-scaling solution for SMR systems in virtualized environments, which can dynamically adapt the number of available cores to current load. In similar approaches, benefits of CPU core scaling are usually small due to the inherent sequential execution of SMR systems in order to achieve determinism. In our approach, we utilize sophisticated deterministic multithreading to avoid this bottleneck and experimentally demonstrate that core scaling then allows SMR systems to effectively tailor resources to service load, dramatically reducing service provider costs.


Hauck, F.J., Habiger, G. and Domaschka, J. 2016. UDS: a novel and flexible scheduling algorithm for deterministic multithreading. Proc. of the 35th Int. Symp. on Reliable Distrib. Sys. (SRDS) (Budapest, Hungry, Sep. 2016).
Hauck, F.J. and Domaschka, J. 2016. UDS: a unified approach to determinisitic multithreading. 36th Int. Conf. on Distrib. Comp. Sys. (ICDCS) (Nara, Japan, Jun. 2016).
Habiger, G., Hauck, F.J., Köstler, J. and Reiser, H.P. 2016. Vertikale Skalierung für aktiv replizierte Dienste in Cloud-Infrastrukturen.