Description

Segment missingness can be annotated in the study data (mainly) in two ways:

  1. Participation in study segments is not recorded by specific variables. For example, there is no variable to acknowledge that a participant refused or could not take part in a specific examination, even though the all the measurements for this participant in this segment are missing.

  2. There are specific variables to record individual participation in each study segment. For instance, a variable may indicate participation in the laboratory examination.

Use case (1) may be common in smaller studies. To calculate segment missingness, this implementation assumes that study variables are nested in respective segments and that the metadata specifies this structure. The function identifies all variables within each study segment, returns TRUE if all variables in a segment are missing, and FALSE otherwise.

Use case (2) assumes a more complex study data and metadata structure, with study data including so-called intro-variables (which can be either TRUE/FALSE or codes for non-participation). The column STUDY_SEGMENT (previously KEY_STUDY_SEGMENT) in the metadata contains the name of the segment (usually this is the respective intro-variable label for each measurement variable). The column PART_VAR contains the actual variable that describes whether the variable should be present or not, likely reflecting the hierarchical study structure. In the subsequent calculation of missingness, this structure allows obtaining the correct denominators to calculate missingness rates.

The com_segment_missingness function implements the Missing values indicator, which belongs to the Crude Missingness domain in the Completeness dimension. For more details, see the user’s manual and the source code.

Usage and arguments

com_segment_missingness(
  study_data = sd1,
  meta_data = md1,
  label_col = "LABEL",
  threshold_value = 5,
  color_gradient_direction = "above",
  exclude_roles = c("secondary", "process")
)

The com_segment_missingness function has the following arguments:

  • study_data: mandatory, the data frame containing the measurements.
  • meta_data: mandatory, the data frame containing the study data’s metadata.
  • group_vars: the variable used for grouping (e.g., observer or device). Defaults to NULL for output without grouping.
  • strata_vars: optional, the variable used for stratification. Defaults to NULL for no stratification.
  • label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.
  • threshold_value: optional with default (10%), a value between 0 and 100 that defines the percentage of missings per segment that should be considered critical. See also Algorithm of the implementation.
  • color_gradient_direction: optional with default above, can be either above or below with respect to the threshold_value. Are the critical deviations above or below the threshold value? If values above the threshold are considered critical, above should be selected; otherwise, low should be used. See also Algorithm of the implementation.
  • exclude_roles: optional, a character vector specifying the variable roles to exclude.

Segment missingness can be calculated for stratified data. In this case strata_vars must be specified.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

For the segment missingness function, the metadata column STUDY_SEGMENT is crucial. According to the use case (2) (see Description), this column specifies the intro-variable ID for each measurement variable. However, the content of this column can also be strings.

VAR_NAMES LABEL DATA_TYPE STUDY_SEGMENT
4 v00003 AGE_0 integer STUDY
39 v00030 MEDICATION_0 integer INTERVIEW
1 v00000 CENTER_0 integer STUDY
34 v00025 SMOKE_SHOP_0 integer INTERVIEW
23 v00016 DEV_NO_0 integer LAB
43 v40000 PART_INTERVIEW integer INTERVIEW
14 v00009 ARM_CIRC_0 float PHYS_EXAM
18 v00012 USR_BP_0 string PHYS_EXAM
33 v00024 SMOKING_0 integer INTERVIEW
21 v00014 CRP_0 float LAB


No stratification

The next function call specifies the analyses of missing segments without stratification, setting the threshold to 5%:

seg_miss_1 <- com_segment_missingness(
  study_data = sd1,
  meta_data = md1,
  label_col = "LABEL",
  threshold_value = 5,
  direction = "high",
  exclude_roles = c("secondary", "process")
)

The function outputs the lists SummaryData and ReportSummaryTable and SummaryPlot. The SummaryData data frame expands over all possible combinations of aux_variable levels and examinations identified in the metadata. The threshold_value and the color_gradient_direction specified by the user are added to the data frame. Since color_gradient_direction = "above" all values above the threshold are considered critical and flagged with GRADING = 1.

Run seg_miss_1$SummaryData to see the output:

Group Examinations No. of Participants No. of missing segments (%) of missing segments threshold direction GRADING
1 STUDY 2940 0 0.00 5 above 0
1 PHYS_EXAM 2940 160 5.44 5 above 1
1 LAB 2940 113 3.84 5 above 0
1 INTERVIEW 2924 332 11.35 5 above 1
1 QUESTIONNAIRE 2864 0 0.00 5 above 0

The second output, ReportSummaryTable, is a heatmap-like graphic that highlights critical values depending on the respective threshold_value and color_gradient_direction. Call it with mp1$SummaryPlot:

Stratification

For some analyses, it is necessary to add new, transformed variables to the study data:

# use the month function of the lubridate package to extract month of exam date
require(lubridate)
# apply changes to copy of data
sd2 <- sd1
# indicate first/second half year
sd2$month <- month(sd2$v00013)

In this case, the variable metadata must be added to the study metadata:

md_temp <- prep_add_to_meta(
  VAR_NAMES = "month",
  DATA_TYPE = "integer",
  LABEL = "EXAM_MONTH",
  VALUE_LABELS = "1 = January | 2 = February | 3 = March |
                  4 = April | 5 = May | 6 = June | 7 = July |
                  8 = August | 9 = September | 10 = October |
                  11 = November | 12 = December",
  MISSING_LIST = "",
  PART_VAR = "v20000",
  meta_data = md1
)

A subsequent call of the function may include the new variable:

seg_miss_2 <- com_segment_missingness(
  study_data = sd2,
  meta_data = md_temp,
  group_vars = "EXAM_MONTH",
  label_col = "LABEL",
  threshold_value = 1,
  direction = "high",
  exclude_roles = c("secondary", "process")
)

The output of mp1$SummaryPlot now uses facets from the package ggplot(), such that the stratum from the new variable represents one facet:

Interpretation

This indicator uses a simple user-defined threshold. By default, the highest deviation from the threshold value is always displayed in dark red, irrespective of the absolute deviation. Classifying a deviation as critical is up to the user and involves qualitative interpretation.

Algorithm of the implementation

This implementation uses one threshold to discriminate critical from non-critical values. For instance, if threshold_value = 9 and color_gradient_direction = "above", then all values lower than thethreshold_value are considered normal (displayed in dark blue in the plot and flagged with GRADING = 0 in the data frame), and all values above the threshold_value are considered critical. The displayed color shifts to a darker red as the values deviate more from the threshold. All critical values are highlighted with GRADING = 1 in the summary data frame. By default, the highest values are always shown in dark red irrespective of the absolute deviation.

Conversely, if color_gradient_direction = "below" (for the same threshold_value = 9), all values greater than the threshold_value are assumed normal (displayed in dark blue and with GRADING = 0) and values lower than the threshold_value are considered as deviations.