Introduction

The most basic data quality (DQ) assessments target single data elements (variables/items). Data quality assessments therefore require item level metadata. For detailed information on the item level metadata please see Richter et al. 2019. An overview on item level metadata used by dataquieR is provided below.

Item level metadata for data quality reporting

VARIABLE AND VALUE LABELS

VAR_NAMES

The first column specifies the variable name in the study data to be analysed. The input must be a string without blank spaces.

LABEL

Appropriate labels are a necessary precondition for readable data quality reports. Their absence, however, does not affect the functionality of the statistical implementations.

CAVEAT: A necessary convention for all labels in the current project phase is the definition of unique + short labels. This is necessary since labels that are too long may corrupt reports.

Assigning labels to variables is important because variable names in the study data are rather technical and limiting to useful interpretations. As variable names, each variable label should be unique. In addition, labels should be as short as possible to ensure a readable output. To enhance the presentation and plotting quality character length specified in LABELS should not exceed 20 characters.

VAR_NAMES	LABEL
v00000	CENTER_0
v00001	PSEUDO_ID
v00002	SEX_0
v00003	AGE_0
v00103	AGE_GROUP_0

All implementations of dataquieR support the use of LABELS.

LONG_LABEL

Under some circumstances, a short label or variable name is insufficient to provide all necessary information. The column “LONG_LABEL” can be filled with self-explaining annotations for variables. Long labels are more relevant for tabular output than for graphical output.

Short or long labels can be defined in all implementations of dataquieR by specifying the label_col formal as an input.

VALUE_LABELS

Categorical variables in the study data are often coded as integers (e.g. 0, 1). Because the number is non-informative labels are essential to secure understandable reports, e.g:

The sex of participants can be coded as \(0 = females\) and \(1 = males\).
The presence of a disease can be coded as \(0 = no\) and \(1 = yes\).

To make use of VALUE_LABELS in dataquieR the following convention has been made: all values of a study variable and respective labels can be summarized in a list using the pipe operator \(|\) for separation. For example:

	VAR_NAMES	LABEL	VALUE_LABELS
3	v00002	SEX_0	0 = females \| 1 = males
12	v00007	ASTHMA_0	0 = no \| 1 = yes

To enhance presentation and plotting quality the character length of a value label specified in VALUE_LABELS should not exceed 20 characters.

DATA_TYPE

In contrast to [LABEL] the definition of the DATA_TYPE is crucial because the applicability of dataquieR functions may depend on the data type.

The following DATA_TYPES are differentiated in dataquieR:

float
integer
datetime
string

The list appears small compared to some electronic data capturing systems (e.g. REDCap, Harris et al. 2009) or Shiny Apps (Chang et al. 2018). However, the data type should not be mixed up with data entry types which could be very different using sliders or radio buttons. Similarly, the data type is not a statistical property such as an ordinal characteristic.

	VAR_NAMES	LABEL	DATA_TYPE
2	v00001	PSEUDO_ID	string
3	v00002	SEX_0	integer
9	v00004	SBP_0	float
10	v00005	DBP_0	float
19	v00013	EXAM_DT_0	datetime

SCALE_LEVEL

The definition of the SCALE_LEVEL is important because it defines what type of mathematical operations and dataquieR functions can be applied to the data.

The following SCALE_LEVEL are allowed in dataquieR:

nominal
ordinal
interval
ratio
na

The category na is used for variables that do not fit in the other categories (e.g., unstructured texts, json, xml).

MISSINGS

Data often contain a qualification of values which are not measurements. These are for example codes for missing values. The figure below shows the use of such codes in the variable V_0101. Both, measurement values and missing codes are considered as data values.

Categorization of measurements and missing values in dataquieR

Using such codes may complicate the application of standardized routines for DQ assessment since coded missing measurements must be correctly interpreted. For example, it must be secured that a data value representing a missing code is not treated as a measurement value to avoid spurious results when addressing data accuracy. Therefore codes representing non-measurement values must be correctly identified and treated correctly.

dataquieR distinguishes missing codes through the MISSING_LIST metadata column and jump codes by the JUMP_LIST column. However, providing the codes in an additional table called missing table is recommended. For this, a string with the table name can be specified in the column MISSING_LIST_TABLE, referring to a spreadsheet in the same or another workbook or an URL.

CAVEAT: currently, within a variable, dataquieR only accepts missing codes that match the data type of the variable. For dates, for example, the missing table can only contain missing codes in date time format, such as 1800.01.01 00:00:00 AM; and not numeric codes such as 99981. This must be kept in mind for the metadata columns MISSING_LIST, JUMP_LIST and the MISSING_LIST_TABLE.

MISSING_LIST

Codes specified in the MISSING_LIST indicate unexpected missingness of measurements, for example missing values due to refusals or technical problems.

The MISSING_LIST is a list of pipe \(|\) separated numeric codes: \(99980\: |\: 99983\: |\: 99988\).

JUMP_LIST

Codes in the JUMP_LIST indicate measurements which are missing by design. For example, if a sub-sample of a study population does not participate in a specific examination (by design) then jump-codes should be used to indicate this reason for missingness.

The JUMP_LIST is a list of pipe \(|\) separated numeric codes: \(88880\: |\: 88883\: |\: 88884\).

MISSING_LIST_TABLE

The name and location of a table containing the missing assignments for the respective variable. The input must be a string and can refer to a spreadsheet in the same or another workbook.

In the example below, the missing codes for the variables are specified in the sheet called missing table of the same workbook.

	VAR_NAMES	LABEL	MISSING_LIST_TABLE
9	v00004	SBP_0	missing_table
10	v00005	DBP_0	missing_table
11	v00006	GLOBAL_HEALTH_VAS_0	missing_table

If the codes were defined in another workbook, the path and name of the spreadsheet must be given. For example: "d:/data/questionnaire_data_codes.xlsx | missing_codes".

LIMITS

Limits describe ranges to check the plausibility of measurement values (hard, soft limits) or to identify measurements outside a measurable range (detection limits). Limits may apply to study data of type: float, integer, and date-time. Specifying limits can be content-driven (e.g. based on clinical information) or may depend on properties of the used examination device or the outcome under study. For example, body weight cannot be negative.

Unfortunately, the definition of limits can be ambiguous:

a plausibility limit of “\(\gt10\)” may imply that all values above are plausible.
however, this notation is also frequently used to guide decisions in eCRFs, i.e. if a value is “\(\gt10\)” than alert the user regarding an implausible value.

To avoid this ambiguity, HARD_LIMITS, SOFT_LIMITS, and DETECTION_LIMITS in the metadata are defined using interval notation. Values inside the interval are eligible/plausible/possible. The definition of intervals adheres also to a distinguished use of braces:

\((0;\:10)\): open interval, i.e. values \(>0\) and \(<10\) are inside the interval.
\((0;\:10]\): left-open interval, i.e. values \(>0\) and \(\le10\) are inside the interval.
\([0;\:10)\): right-open interval, i.e. values \(\ge0\) and \(<10\) are inside the interval.
\([0;\:10]\): is a closed interval, i.e. values \(\ge0\) and \(\le10\) are inside the interval.

Each side of the interval must be defined by a value of the same type as the measurement (including dates and date-times). If the range is undefined \(-Inf\) and/or \(Inf\) have to be defined. Please see the examples provided in [Metadata in dataquieR].

Two types of limits may be distinguished depending on whether the range indicates inadmissible or just unlikely values.

HARD_LIMITS

HARD_LIMITS should be specified to identify inadmissible values. Inadmissibility does not necessarily mean impossible. For example, while it is known that the heaviest man on Earth did weigh more than 600kg, it may be reasonable to declare values above 250kg as inadmissible because under the circumstances of a general-population study in Germany it is deemed unlikely that a heavier person may arrive at the examination center.

For example, for blood pressure measurements, we may specify the following hard limits.

	VAR_NAMES	LABEL	HARD_LIMITS
9	v00004	SBP_0	[80;180]
10	v00005	DBP_0	[50;Inf)

SOFT_LIMITS

The functionality of SOFT_LIMITS is similar to HARD_LIMITS. However, values outside the limits are not removed, because SOFT_LIMITS indicate improbable but not impossible measurements.

The formal setup of SOFT_LIMITS is identical to HARD_LIMITS, as shown in the metadata excerpt below.

	VAR_NAMES	LABEL	SOFT_LIMITS
9	v00004	SBP_0	(90;170)
10	v00005	DBP_0	(55;100)

DETECTION_LIMITS

The definition of DETECTION_LIMITS can be necessary if measurement devices have predefined limits of sensitivity. It is possible that measurements are indicated as being below or above the DETECTION_LIMITS. Such information should result in a different management of respective data values as they are still informative and can be used in later analysis.

Values outside detection limits are not removed.

The formal setup of DETECTION_LIMITS is identical to HARD_LIMITS and SOFT_LIMITS, as can be seen in the metadata example below.

	VAR_NAMES	LABEL	DETECTION_LIMITS
9	v00004	SBP_0	[0;265]
10	v00005	DBP_0	[0;265]
21	v00014	CRP_0	[0.16;Inf)

ACCURACY RELATED

DISTRIBUTION

If the expected probability distribution of an outcome is known, deviations from this distribution can be examined. For example, systolic blood pressure may be assumed to approximately follow a normal distribution. One examiner of blood pressure measurement may round values to decadic numbers which can lead to a serious distortion of the data.

Currently, the following probability distributions are supported for an assessment by dataquieR with a dedicated DQ-implementations:

normal
uniform
gamma

This restriction does not impede consideration of count data.

As shown below, we may expect a normal distribution for blood pressure variables, while the expectation may be a gamma distribution for the C-reactive protein measurement.

	VAR_NAMES	LABEL	DISTRIBUTION
9	v00004	SBP_0	normal
10	v00005	DBP_0	normal
21	v00014	CRP_0	gamma

DECIMALS

In the column DECIMALS, the accuracy of measurements in terms of expected decimal numbers can be specified. This may for example be necessary, if rounding is expected when addressing end digit preferences.

In the example below, no decimals are expected for the blood pressure variables, whereas three decimals are expected for the C-reactive protein measurement

	VAR_NAMES	LABEL	DECIMALS
9	v00004	SBP_0	0
10	v00005	DBP_0	0
21	v00014	CRP_0	3

UNIVARIATE_OUTLIER_CHECKTYPE

Sets the type of check for multivariate outliers assessment. Either Tukey, 3SD, Hubert or SigmaGap approaches are currently supported as input.

For instance, in the metadata excerpt below, SigmaGap and Tukey are specified.

	VAR_NAMES	LABEL	UNIVARIATE_OUTLIER_CHECKTYPE
37	v00028	INCOME_GROUP_0	Sigma_gap \| Tukey

N_RULES

Specifies the number of rules that must be violated for an observation to be flagged as an outlier. It applies to all potential assessment rules for univariate outliers.

The metadata example below specifies the default testing of the four outlier criteria (Tukey, 3SD, Hubert and SigmaGap) for the smoking variable. Since N_RULES is four, an observation will be flagged as an outlier when all criteria classify it as an outlier. In contrast, for the income variable, only SigmaGap and Tukey are specified; hence the maximum number of rules that can be violated is two, and an observation will be marked as an outlier when identified as such by both tests.

	VAR_NAMES	LABEL	UNIVARIATE_OUTLIER_CHECKTYPE	N_RULES
33	v00024	SMOKING_0	NA	4
37	v00028	INCOME_GROUP_0	Sigma_gap \| Tukey	2

LOCATION_METRIC

Defines the location metric to be used. In the example below, the mean is specified for both blood pressure variables.

	VAR_NAMES	LABEL	LOCATION_METRIC
9	v00004	SBP_0	Mean
10	v00005	DBP_0	Mean

LOCATION_RANGE

Defines the range for the expected location parameter specified in LOCATION_METRIC. For the blood pressure variables, the example below shows the expected location of the mean value.

	VAR_NAMES	LABEL	LOCATION_METRIC	LOCATION_RANGE
9	v00004	SBP_0	Mean	(100;140)
10	v00005	DBP_0	Mean	(60;100)

PROPORTION_RANGE

Defines the range of the expected proportions. For instance, the metadata excerpt below shows the specification of the expected sex proportions in the data.

	VAR_NAMES	LABEL	PROPORTION_RANGE
3	v00002	SEX_0	(48;52)
7	v01002	SEX_1	(48;52)

CO_VARS

Defines the variables to be used as covariates in regression analyses. These are used, for example, to adjust in analyses related to observer effects. For the blood pressure variables, both age and sex are taken as covariates in the metadata excerpt below.

	VAR_NAMES	LABEL	CO_VARS
9	v00004	SBP_0	AGE_0 \| SEX_0
10	v00005	DBP_0	AGE_0 \| SEX_0

PROCESS RELATED

Process related metadata provides information on the circumstances under which a measurement has been collected. Such information may be the ID of an examiner and device, the time when a measurement has been conducted or the way of data entry Related information is an essential precondition for the computation of several data quality indicators.

DATA_ENTRY_TYPE

Some measurements may be transferred from a measurement device to the case-reporting-form. Manual data entry can be prone to typos and errors. The respective measurements should be checked for abnormalities. DATA_ENTRY_TYPE acts as an indicator for manual data transfer \((0=none, 1=yes)\).

For instance, the data entry type may be automated from a blood pressure measurement device but it could be manual for the blood cell sedimentation rate (BSG).

	VAR_NAMES	LABEL	DATA_ENTRY_TYPE
9	v00004	SBP_0	0
10	v00005	DBP_0	0
22	v00015	BSG_0	1

GROUP-COLUMNS

Group columns in the metadata specify relations between different (study data) variables. For example, a link should be established between variables related to measurements (e.g., systolic blood pressure or SBP), and the variables that indicate which examiners conducted these examinations. This information is necessary to calculate observer effects.

There are different types of process variables, which may be assigned through group columns. These are

GROUP_VAR_OBSERVER for the examiners
GROUP_VAR_DEVICE for devices
TIME_VAR for variables indicating the date or time at which measurements were conducted
STUDY_SEGMENT to denote the study segment (e.g., examination) to which a variable belongs

The group columns specification is essential to enable a semi-automatic analysis of observer or device effects and assess effects over time. Specifying group columns is a prerequisite for powerful DQ-reports pipeline approaches.

An example of the definition of such metadata can be seen below.

	VAR_NAMES	LABEL	GROUP_VAR_OBSERVER	GROUP_VAR_DEVICE	TIME_VAR	STUDY_SEGMENT
9	v00004	SBP_0	USR_BP_0	NA	EXAM_DT_0	PHYS_EXAM
10	v00005	DBP_0	USR_BP_0	NA	EXAM_DT_0	PHYS_EXAM
14	v00009	ARM_CIRC_0	NA	ARM_CUFF_0	EXAM_DT_0	PHYS_EXAM
16	v00010	ARM_CUFF_0	NA	NA	NA	PHYS_EXAM
18	v00012	USR_BP_0	NA	NA	NA	PHYS_EXAM
19	v00013	EXAM_DT_0	NA	NA	NA	PHYS_EXAM

REPORT DESIGN

VARIABLE_ROLE

This attribute is currently used only in internal functions.

VARIABLE_ORDER

In this column the order of the variables in a data quality report can be defined. For example, this column may be as follows.

VAR_NAMES	LABEL	VARIABLE_ORDER
v00000	CENTER_0	1
v00001	PSEUDO_ID	2
v00002	SEX_0	3
v00003	AGE_0	4

How `dataquieR` uses item level metadata

dataquieR employs the predefined item level metadata in two ways:

For each variable of the study data named in a function call of a DQ implementation, the respective metadata are interpreted from a data frame of metadata.
Some implementations also search for relations between variables, such as a date-time-stamp that belongs to a measurement. The section GROUP-COLUMNS explains the definition of such relations.

Therefore, metadata and study data must be defined in a 1:1 correspondence, i.e., each variable of the study data is identifiable in the metadata. The key for this mapping is the variable name, listed in the column VAR_NAMES in the metadata. A necessary convention regarding variable names is their uniqueness, i.e., none should have a duplicate (also implied by the 1:1 correspondence). Further, all metadata columns are defined in upper case letters to distinguish them from the study data.

Typical item-level metadata are the variable name (or identifier), measurements (e.g., C-reactive protein or CRP), and process variables (e.g., examiner ID) (left panel in the figure below). In some cases, metadata variables must be added if appropriate metadata varies across observations. The link from the measurement variable (CRP) to the respective metadata variable (RefLimits_v101) is defined via its own column in the metadata. The top right panel of the figure below shows key_ref_limits, which specifies which variable contains the time-varying reference limits for CRP. Similar columns are denoted as group columns and point to the associated metadata variable. Such structural information is required to implement automated data quality assessment procedures. The bottom right panel of the figure below depicts the relationship between study data, metadata, and links between study data.

Overview of metadata usage in dataquieR (Richter et al. 2019)

dataquieR uses the following terms for data structures:

Key terms related to the data structure used in dataquieR (Schmidt et al. 2021)

Back to Metadata

Chang, W., Cheng, J., Allaire, J., Xie, Y., McPherson, J., et al. (2018). Shiny: Web application framework for r, 2015. R Package Version 1, 14.

Harris, P.A., Taylor, R., Thielke, R., Payne, J., Gonzalez, N., and Conde, J.G. (2009). Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics 42, 377–381.

Richter, A., Schössow, J., Werner, A., Schauer, B., Radke, D., Henke, J., Struckmann, S., and Schmidt, C. (2019). Data quality monitoring in clinical and observational epidemiologic studies: The role of metadata and process information. GMS Med Inform Biom Epidemiol 15.

Schmidt, C.O., Struckmann, S., Enzenbach, C., Reineke, A., Stausberg, J., Damerow, S., Huebner, M., Schmidt, B., Sauerbrei, W., and Richter, A. (2021). Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in r. BMC Medical Research Methodology 21, 1–15.

Definition and use of item level metadata