Data quality in user-generated content

Data quality in user-generated content

In an increasingly digital world, the amount of user-generated content (UGC) - such as customer reviews on review sites, articles in wikis, or posts on social media - is growing rapidly. Because of the enormous economic potential of unstructured data, machine learning methods to analyze unstructured data are gaining popularity in research and practice alike. However, such analyses and their results can only be valid and provide added value if the input data on which they are based is quality-assured.

Nevertheless, in contrast to the field of structured data, no comparable approaches to automated measurement and improvement of data quality exist to date for unstructured, textual UGC. In addition, the machine learning methods currently used to analyze UGC only take very limited account of the fact that textual UGC can exhibit poor data quality. The planned project addresses this issue by seeking solutions for measuring the data quality of UGC and incorporating the data quality of UGC into machine learning methods. The University of Ulm, in cooperation with the University of Regensburg, is pursuing the following research questions:

1.    How can DQ be measured and improved in textual UGC in an automated way?
2.    How can DQ-annotated textual UGC be processed in machine learning methods?

Cooperation partners: University of Regensburg

Funding body: German Research Foundation (DFG)

Project period: until 2024