Data quality in user-generated content

Data quality in user-generated content

In an increasingly digital world, the amount of user-generated content (UGC) - e.g., customer reviews on rating platforms, articles in wikis, or posts in social media - is growing very rapidly. Since this treasure trove of data holds enormous economic potential, machine learning methods for analyzing the large amounts of unstructured data have become highly relevant in science and practice in recent years. However, such analyses and their results  can only be valid and value-adding if the underlying input data are quality-assured.

Nevertheless, in contrast to the area of structured data, no comparable approaches to automated measurement and improvement of data quality exist to date for unstructured, textual UGC. Also, the machine learning methods used for analysis currently only take very limited account of the fact that textual UGC can exhibit poor data quality. This is where the planned project comes in and seeks solutions for measuring the data quality of UGC and for taking the data quality of UGC into account in machine learning methods. The University of Ulm, in cooperation with the University of Regensburg, is pursuing the following research questions:

  1. How can DQ be measured and improved in textual NGI in an automated way?
  2. How can DQ-annotated textual NGI be methodically processed in machine learning procedures?

Cooperation partners: University of Regensburg

Funding body: German Research Foundation (DFG)

Project period: until 2024