Leverage Big Code for Challenges in Software Modernization

Ulm University

MA Abschlussvortrag, Hiba Bouzaida, Ort: Zoom, Datum: 17.11.2021, Zeit: 10:30 Uhr


In recent years, the size of data and code bases has grown dramatically and managing massive volumes of source code (aka "Big Code") and millions of metadata, such as source code changes, bug fixes, and code reviews has become a significant issue facing software organizations. Many companies, even these with basic software requirements, may need to maintain thousands of lines of code that is written in different languages, uses widely varying data structure and contains dependencies from dozens of projects.

In particular, big code represents a major challenge when it comes to software modernization. Converting legacy systems (old technologies, computer systems, or application programs that continue to be used and still valuable for the organization) into modern software architectures requires high program understanding, business rules extraction, and software transformation. This may be problematic, due to compatibility issues, complex structure, lack of understanding of dependencies or lack of documentation.
Moreover, with an increasing volume of code bases, it is harder for software developers to deal with inherent complexity while delivering high-quality software products on time. Today, there is several new methods and tools, like Sourcegraph and OpenGrok, that help better understand large code bases and software repositories. However, existing solutions focus only on information contained in source code.

This work goes beyond source code analyses and focuses on metadata on code gathered from external systems that describe code, such as organizational data, in order to uncover hidden risks and discover new opportunities that may impact the value of an application. The new approach presented throughout this thesis aims to leverage big code by addressing three core challenges: identifying relevant external systems and metadata for data analysis, creating a suitable data model to unify the large amount of heterogeneous metadata about code as well as acquiring relevant findings from analyzing these metadata to help gain insights about hidden patterns and risks in order to enhance software-quality and elevate the business value.