Data quality measurement and measures for wikis and knowledge graphs

Today, companies have a wide range of internal and external data sources at their disposal. Organizations that make intensive use of large volumes of data (the keyword here is "big data") to support decision-making, and that see themselves as "data-driven", can demonstrate significantly better financial and operational results. Against this backdrop, the data used in organizations is a critical resource. Currently, however, internal and external data is often held in many separate "data silos" with varying degrees of structure. This makes it difficult to use, as employees cannot effectively and efficiently find and access the data that is relevant to the situation. Two modern options for mapping and formalizing static and dynamic domain knowledge are (enterprise) wikis and knowledge graphs.

(Enterprise) Wikis can provide data, information and knowledge of many employees in semi-structured but also unstructured (textual) form. For example, Wikipedia became a central data source on the Internet within a short period of time after its launch in 2001 - both for organizations and for individuals.

While wikis make data and information easily accessible and editable for humans, their semi-/unstructured format makes them more difficult to access for algorithmic evaluation (keywords machine learning and artificial intelligence) and especially semantic interpretation. Specifically, it is only possible to a limited extent for algorithms to automatically extract the semantic relations of the data and information of a wiki in a quality-assured manner. Here, knowledge graphs offer the possibility to make data and information (e.g., from wikis) in the form of attributes and relations of and between entities available for further services and to derive new insights from them.

The characteristics of wikis and knowledge graphs, for example with regard to the degree of structuring, the large number of potential data sources and people involved (with mostly uncontrolled publication and usage processes), and the semantic relations that are difficult to access, pose new challenges for ensuring data quality. If data quality cannot be ensured, this will have significant negative consequences, since the results derived from wikis and knowledge graphs, as well as decisions based on them, are only valid and value-creating if the underlying data and information are (semantically) correct, consistent, up-to-date, etc. In addition, many knowledge graphs are built automatically (e.g. DBpedia) by extracting structured data and information from unstructured text data from wikis. On the one hand, this represents a great opportunity for data management, as the knowledge stored in wikis becomes more usable for algorithms in the structured form of a knowledge graph. On the other hand, data quality problems of the underlying unstructured data can directly transfer to the knowledge graph in this process, or new data quality problems can be added due to errors during the extraction process.

Therefore, in the project "Data Quality Measurement and Measures for Wikis and Knowledge Graphs (DQMM@Wiki)", methods and metrics for measuring as well as measures for improving the data quality of wikis and knowledge graphs will be developed. In addition, approaches for the automated and quality-assured creation of knowledge graphs from unstructured data such as (enterprise) wikis will be developed.

Cooperation partner: xapio GmbH

Sponsor: Bavarian State Ministry of Economic Affairs, Regional Development and Energy

Project period: 2020 - 2023