Measuring and improving data quality in unstructured data

Due to rapidly growing amounts of unstructured data (e.g. "Big Data"), data quality (DQ) is a highly relevant topic. For example, large amounts of unstructured data from different, distributed sources are collected and analysed in diverse formats (often in real time) to derive business-relevant insights, support business decisions and develop data-driven services. Ensuring the quality of the underlying data is essential for the derived results to be valid and add value. Insufficient DQ results in erroneous findings that lead to incorrect decisions and cause more harm than good ("garbage in – garbage out").

In the project "Data Quality Measurement and Measures for Unstructured Data (DQMM)", we’ve developed quantitative methods for measuring, controlling and improving DQ and evaluated them on the basis of concrete application scenarios. We have thus developed efficient quantitative methods for measuring DQ in terms of data value-driven quality dimensions (e.g., consistency, timeliness, accuracy and uniqueness). The team has also (further) developed methods for analysing unstructured data (e.g., data mining and text mining methods) to directly take into account the measured DQ level. The results obtained have far-reaching economic and scientific implications: Taking DQ into account in data analysis methods leads to more reliable results and improved decision quality. Specifically, wrong decisions with high (economic) damages can be avoided (e.g., wrong assessment of competition or credit risk) and novel data-driven services can be developed. Measuring the DQ of unstructured data is an essential prerequisite for this.

Cooperation partner: xapio GmbH

Funding body: Bavarian Ministry of Economic Affairs, Regional Development and Energy

Project period: June 2015 - June 2018