Introduction

Metadata is considered as “data that describe other data” (Nadkarni 2011). Metadata provides information to support the correct interpretation of study data and guide data quality (DQ) assessments as well as statistical analyses. Metadata are, for example, lists of value codes to examine reasons for incomplete data or value labels to support interpretable reports. Some metadata is specific for certain DQ assessments, while others will be used across most DQ implementations.


Storage of metadata

Metadata is commonly stored in data dictionaries (DDs). DDs frequently contain the name of a variable, its data type, and, if applicable, labels for the levels of a categorical variable (Meyer et al. 2012). DDs should be available for the study data of each research study. However, DDs often host only a subset of all information necessary for data quality assessments. Thus, DDs need to be extended on aspects related to data quality. If this is not possible, metadata may also be stored in a spreadsheet-type format, such as data frames. dataquieR uses predefined metadata provided as data frames, as described below.


How dataquieR uses metadata

The metadata schema used by dataquieR is based on a formal data quality framework for observational studies (Schmidt et al. 2021). dataquieR makes use of metadata, that has been organized in a structured form across four tables:

  1. Item level: descriptions and expectations about single data elements (variables/items), e.g. columns in the study data table.
  2. Cross-item level: descriptions and expectations about the joint use of two or more data elements (variables/items) for data quality assessments.
  3. Segment level: descriptions and expectations about the provided segments (e.g., different study examinations).
  4. Data frame level: descriptions and expectations about entire data frames.

The metadata schema also allows users to enter, for example, a missing table that defines missing and jump assignments per variable, and reference tables for participant IDs at the segment and data frame levels.

Each metadata table is arranged as a spreadsheet in a workbook to facilitate user input. Users can provide metadata directly in the spreadsheet or by specifying the source file for a specific item (e.g., another spreadsheet or an URL). Additionally, the tables can contain information to control the report output (e.g., the role or order of variables in the report) and the calculation of the quality indicators.

NOTE: In all metadata tables, the column names are written in upper case letters to distinguish them from the column names in the study data.


Back to Tutorials

Meyer, J., Ostrzinski, S., Fredrich, D., Havemann, C., Krafczyk, J., and Hoffmann, W. (2012). Efficient data management in a large-scale epidemiology research project. Computer Methods and Programs in Biomedicine 107, 425–435.
Nadkarni, P.M. (2011). Metadata-driven software systems in biomedicine: Designing systems that can adapt to changing knowledge (Springer Science & Business Media).
Schmidt, C.O., Struckmann, S., Enzenbach, C., Reineke, A., Stausberg, J., Damerow, S., Huebner, M., Schmidt, B., Sauerbrei, W., and Richter, A. (2021). Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in r. BMC Medical Research Methodology 21, 1–15.