Prof. Dr. Mathias Klier
+49 (0) 7 31 50-3 23 12
mathias.klier(at)uni-ulm.de
Text-based user-generated content (UGC), such as customer reviews, wiki entries or social media posts, now forms a key foundation for data-driven applications. The widespread use of generative AI systems such as ChatGPT and other large language models has made it clear just how much the performance of modern AI depends on the quality of textual data. Inadequate data quality can not only compromise the quality of analysis results, but also lead to bias, instability and decisions that are difficult to understand.
The DQUGC project, funded by the German Research Foundation (DFG), is a follow-up to the DQNGI project and is being carried out as part of a DFG continuation grant. In the predecessor project, a key conceptual contribution to the measurement of data quality was made with a publication in MIS Quarterly: for the first time, it was demonstrated how events can be explicitly modelled as causes of data quality problems and identified via characteristic patterns in the data. Using duplicates as an example, an event-driven approach was developed that makes data quality measurable probabilistically via event-specific data patterns, rather than purely syntactically.
DQUGC specifically adopts this event-based approach and develops it further. The aim of the project is to roll out the concept of event-driven data quality measurement to other types of textual data as well as to additional data quality dimensions beyond duplicates. DQUGC focuses particularly on unstructured, textual content, such as that found in UGC and the training data of modern AI systems.
A central focus is on investigating how event relationships and the data quality information derived from them can be systematically integrated into machine learning methods. This includes, among other things, the use of quality information for weighting, selecting or pre-processing training data, as well as for interpreting model results. In doing so, the project addresses fundamental challenges facing current GenAI systems.
The approaches developed in DQUGC are relevant not only for scientific research in the fields of data quality, text analysis and machine learning, but are also of interest to industry partners who utilise large volumes of textual data or AI-based systems. At the same time, the project offers students the opportunity to engage with current issues at the interface of data quality, events and modern AI as part of their dissertations and research projects.
In cooperation with the University of Regensburg, the University of Ulm is pursuing the following research questions:
- How can data quality in textual user-generated content be measured and improved automatically in an event-driven manner?
How can data quality information be integrated into machine learning methods and GenAI models in a methodologically sound manner?
Cooperation partner: University of Regensburg
Funding body: German Research Foundation (DFG)
Project duration: until 2027