Data quality measurement and measures for wikis and knowledge graphs

Organisations today have a wide variety of internal and external data sources at their disposal. In this context, organisations that make intensive use of larger amounts of data ("Big Data") to support decisions and that consider themselves "data-driven" can yield significantly better financial and operational results. This shows what a critical resource the data used in organisations really are. Currently, however, internal and external data are often held in many separate "data silos" with varying degrees of structure. This makes them difficult to utilise, as employees cannot effectively and efficiently find and access the data relevant to the situation. Two modern options for mapping and formalising static and dynamic domain knowledge for this purpose are (enterprise) wikis and knowledge graphs.

(Enterprise) Wikis can provide data, information and knowledge of many employees in semi-structured but also unstructured (text) form. For example, Wikipedia became a central data source on the Internet within a short period of time after its launch in 2001 – both for organisations and for individuals.

While wikis make data and information easily accessible and editable for humans, their semi-/unstructured format makes them more difficult to access for algorithmic evaluation (i.e. machine learning and artificial intelligence) and especially semantic interpretation. Specifically, it is only possible to a limited extent for algorithms to automatically extract the semantic relations of the data and information of a wiki in a quality-assured manner. Knowledge graphs offer the possibility to make data and information (e.g., from wikis) in the form of attributes and relations of and between entities available for further services and to derive new insights from them.

The characteristics of wikis and knowledge graphs, for example with regard to the degree of structure, the large number of potential data sources and persons involved (with mostly uncontrolled publication and usage processes), and the difficult-to-access semantic relations, pose new challenges for the assurance of data quality. Not being able to ensure data quality has significant negative consequences because the results derived from wikis and knowledge graphs as well as decisions based on them are only valid and value-creating, if the underlying data and information are (semantically) correct, consistent, up-to-date, etc. In addition, many knowledge graphs are built in an automated way (e.g. DBpedia) by extracting structured data and information from unstructured text data from wikis. On the one hand, this represents a great opportunity for data management, as the knowledge stored in wikis becomes more usable for algorithms in the structured form of a knowledge graph. On the other hand, data quality problems of the underlying unstructured data can directly transfer to the knowledge graph in this process, or new data quality problems can accrue due to errors during the extraction process.

The project "Data Quality Measurement and Measures for Wikis and Knowledge Graphs (DQMM@Wiki)" thus aims to develop methods and metrics for measuring as well as measures for improving the data quality of wikis and knowledge graphs. In addition, approaches for the automated and quality-assured creation of knowledge graphs from unstructured data such as (enterprise) wikis will be developed.

Cooperation partner: xapio GmbH

Funding body: Bavarian Ministry of Economic Affairs, Regional Development and Energy

Project period: 2020 - 2023