Data-Driven Search and Analysis of Research Software
M.Sc. Florian Sihler
While it is great to see people focusing their life to software engineering and other related fields in computer science, it is important to keep in mind that many of those who have to code or work with code specialize in other domains.
My current research focuses on helping these people with a non-programmer background, especially with the help of static and dynamic analysis tools. For this, I work on a hybrid dataflow analysis algorithm for the R programming language, which is commonly used for statistical analysis. My work received the YoungRSE award at the deRSE24 and the award for the best master's degree in the field of computer science at Ulm University.
If you are interested, feel free to get in touch with me or check out the flowR repository on GitHub.
Furthermore, I assist in teaching:
Research Projects
Publications
2024
Exploring the Effectiveness of Abstract Syntax Tree Patterns for Algorithm Recognition
4. International Conference on Code Quality (ICCQ)
Juni 2024
DOI: | 10.1109/ICCQ60895.2024.10576984 |
ISBN: | 979-8-3503-6646-4 |
On the Anatomy of Real-World R Code for Static Analysis
21st International Conference on Mining Software Repositories (MSR '24)
Januar 2024
DOI: | 10.1145/3643991.3644911 |
Datei: | https://arxiv.org/pdf/2401.16228.pdf |
2023
GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding
GenBench 2023 Workshop
Oktober 2023
DOI: | 10.48550/arXiv.2311.09707 |
One-Way Model Transformations in the Context of the Technology-Roadmapping Tool IRIS
Journal of Object Technology
Juli 2023
DOI: | 10.5381/jot.2023.22.2.a2 |
2022
A domain-specific language for modeling and analyzing solution spaces for technology roadmapping
Journal of Systems & Software (JSS)
Februar 2022
DOI: | 10.1016/j.jss.2021.111094 |
Topics for Theses and Projects
Dynamic and Static Program Analysis
Context
Most static analyzers rely on static dataflow analysis to detect problems like possible null pointer exceptions in code [5].
However, analyzers are usually unable to handle reflective or self-modifying code (e.g., Java Agents, Java Reflection, R's meta-functions [6]. While this is fine for languages in which such constructs are rare or discouraged, they are 1) used quite often in the R programming language and 2) pose an interesting problem to solve.
Problem
As a basis [3], I have previously created the static dataflow analyzer and program slicer flowR for the R programming language. However, it is currently unable to deal with these reflective and code-modifying constructs like eval, body, quote, and parse in its static dataflow graph.
While handling such constructs statically may be infeasible in the general case, we first want to focus on a set of common cases that appear frequently.
Tasks
- Develop a concept to represent code-modifications and lazy evaluation (within flowR's dataflow graph). For example, to represent a function that has the default values of its arguments or the contents of its body modified.
- Create a proof of concept implementation for this concept in flowR.
Related Work and Further Reading
- K. Cooper and L Torczon. Engineering a Compiler. (ISBN: 978-0-12-818926-9)
- U. Khedker, A. Sanyal, and B. Sathe. Data Flow Analysis: Theory and Practice. (ISBN: 978-0-8493-3251-7)
- F. Sihler. Constructing a Static Program Slicer for R Programs.
- A. Ko and B. Myers. Finding causes of program output with the Java Whyline.
- SonarQube, Sonar.
- Anckaert, B., Madou, M., De Bosschere, K. A Model for Self-Modifying Code.
If you want to, you can have a first look at flowR for yourself: https://github.com/Code-Inspect/flowr.
Contact and More
If you are interested and/or have any questions, feel free to contact me any time.
We can discuss the topic further and try to adapt it to your personal preferences.
Florian Sihler
Context
Dataflow analysis is a very useful and important technique, used, for example, as part of compiler optimizations [1,2] and program comprehension techniques (e.g., slicing [3] or debugging [4]).
Although there is no single dataflow analysis (each analysis answers a slightly different question), dataflow analyzers usually identify how variables in a program relate to each other (e.g., which definitions a variable read my refer to).
Dataflow Analyzers can be split into:
- static analyzers if they use only the source code of a program as input, and
- dynamic analyzers if they use a specific program execution as input.
While static analysis is usually harder, it has lower application constraints as 1) it does not require inputs (from users, files, network-messages, ...), and 2) we do not have to deal with getting a potentially unknown program running. However, dynamic analyzers are usually much more valuable during debugging as they know the path the program took, the potential user inputs, the contents of external files, and more.
Problem
Within my master's thesis [3] that is now the basis of my PhD, I have created the static program slicer flowR for the R programming language, which includes a static dataflow analyzer. However, it offers no dynamic dataflow analysis and does not even attempt to run the respective input program.
Tasks
- Enrich flowR's existing pipeline of parsing, normalizing, static dataflow extraction, static slicing, and code reconstruction with a dynamic dataflow analysis step.
- Given a program (for starters without any external dependencies), the dynamic analysis should be able to determine the execution trace of the program (e.g., branches taken, loops entered and iteration requiered) with the help of R's debugging capabilities and active bindings [5].
- From that, it should be able to infer which variable references read which values (e.g., which definition of a variable was read), what functions have been called, ...
- The planned evaluation is to compare the results of the dynamic analysis with the results of the static analysis and to determine the differences.
Related Work and Further Reading
- K. Cooper and L Torczon. Engineering a Compiler. (ISBN: 978-0-12-818926-9)
- U. Khedker, A. Sanyal, and B. Sathe. Data Flow Analysis: Theory and Practice. (ISBN: 978-0-8493-3251-7)
- F. Sihler. Constructing a Static Program Slicer for R Programs.
- A. Ko and B. Myers. Finding causes of program output with the Java Whyline.
- R, Active Bindings
If you want to, you can have a first look at flowR for yourself: https://github.com/Code-Inspect/flowr.
Contact and More
If you are interested and/or have any questions, feel free to contact me any time.
We can discuss the topic further and try to adapt it to your personal preferences.
Florian Sihler
Context
Static Program Analysis is a well-researched field [1,2], useful in various domains like compiler optimizations [3] and linting [4].
However, static analysis is unable to find semantic smells and bugs and requires a lot of work to set up.
On the other hand, current large language models (LLMs, like ChatGPT) can quickly answer questions about code and find (potential) semantic and syntactic bugs, with an easy-to-use interface and setup required.
Problem
Even though LLMs are easy to use and quick to give an answer, this answer is not always correct [5].
Furthermore, with their hype being relatively new, there is not much research on how their hallucinations hinder linting tasks or make them outright harmful.
To address that, we want to analyze common smells and errors in real-world code (including those that common linters can not find), synthetically generate code with these smells and errors, and then analyze how well LLMs can detect as well as "fix" them.
Tasks
- Identify common smells and errors in real-world R code.
- Synthetically generate code with these smells and errors.
- Analyze/Classify how well LLMs can detect and fix those problems.
Related Work and Further Reading
- García-Ferreira et al., Static analysis: a brief survey, 2016
- Anjana Gosain et al., Static Analysis: A Survey of Techniques and Tools, 2015
- K. Cooper and L Torczon. Engineering a Compiler. (ISBN: 978-0-12-818926-9)
- Hester et al., lintr: A 'Linter' for R Code, 2023
- Zhang et al., Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models, 2023
Contact and More
If you are interested and/or have any questions, feel free to contact me any time.
We can discuss the topic further and try to adapt it to your personal preferences.
Florian Sihler
Context
Dataflow analysis is a very useful and important technique, used, for example, to
- allow compiler optimizations [1,2],
- to aide program comprehension (e.g., [3] or debugging [4]), and
- perform code analysis (e.g., to locate possible null pointer exceptions [5]).
A static dataflow analyzer takes the source code of a program as its input and identifies how variables in a program relate to each other (e.g., which definitions a variable read my refer to).
However, this can happen on arbitrary granularity levels.
For example, when reading a single cell of an array, a coarsely grained analyzer may refer to any potential write to the array, while a more detailed analysis could restrict the definitions to those that modify the respective entry.
Problem
Within my master's thesis [3], which is now the basis of my PhD, I have created the static program slicer flowR for the R programming language, which includes a static dataflow analyzer.
However, it does currently not differentiate the individual cells of arrays or the attributes of an object (i.e., it does not analyze pointers) [6].
Tasks
- Extend flowR's static dataflow analysis with pointer analysis
- Differentiate Cells of a Vector with constant access
- Differentiate Data-Frames, Slots, and other pointer-types
- Track Aliases to identify when pointers relate to each other
- Evaluate the achieved reduction in the size of the resulting slices
Related Work and Further Reading
- K. Cooper and L Torczon. Engineering a Compiler. (ISBN: 978-0-12-818926-9)
- U. Khedker, A. Sanyal, and B. Sathe. Data Flow Analysis: Theory and Practice. (ISBN: 978-0-8493-3251-7)
- F. Sihler. Constructing a Static Program Slicer for R Programs.
- A. Ko and B. Myers. Finding causes of program output with the Java Whyline.
- SonarQube, Sonar.
- M. Hind. Pointer Analysis: Haven’t We Solved This Problem Yet?
If you want to, you can have a first look at flowR for yourself: https://github.com/Code-Inspect/flowr.
Contact and More
If you are interested and/or have any questions, feel free to contact me any time.
We can discuss the topic further and try to adapt it to your personal preferences.
Florian Sihler
M.Sc. Florian Sihler
Institute of Software Engineering and Programming Languages
Albert-Einstein-Allee 11