Data source

The data source for this data quality assessment is an ongoing sero-prevalence study of the Study of Health in Pomerania (SHIP). For confidentiality reasons the raw data are not shown here.

Preprocessing of data

Ship metadata

The source for the metadata has been published in an OPAL repository for COVID-19 studies. As these data follow standardized annotation only a few steps were required to transform the data into a dataquieR conforming format.

R code for the respective transformation is made available upon request. The following metadata adhere to the conventions of the dataquieR R package.

Missing codes

In the SHIP-C19 study data missing codes are used to qualify missing data. These codes were not specified in public access data dictionary. However, upon request the use of following codes were mentioned:

  • 99999 = finally missing
  • 99977 = processing
  • 99988 = permissible jump

These codes were manually added to the DD.

ship_meta <- openxlsx::read.xlsx(xlsxFile = "C:/Users/richtera/Documents/git_projects/dfg_website/_data/SHIP-C19/shipc19_dd_dataquieR_m.xlsx", 
                                 sheet = 1)

The respective code labels are saved in a dataframe.

Variable names

None of the data element names in the study data refer to VAR_NAMES in the metadata.

length(setdiff(names(ship), as.character(ship_meta$VAR_NAMES)))
## [1] 33

Fix: as a simple prefix has been added this issue can easily be fixed using:

ship_names <- names(ship)
ship_names <- gsub("^saq_covid_loop_", "q", perl = TRUE, ship_names)
names(ship) <- ship_names

Incomplete metadata

A check on whether the study data are fully documented in the metadata shows that two variables are still not found in the metadata.

setdiff(names(ship), as.character(ship_meta$VAR_NAMES))
## [1] "id"        "intro_beg"

This has been fixed using the dataquieR::prep_add_to_meta() function to add variables characteristics to the metadata.

ship_meta_m <- dataquieR::prep_add_to_meta(VAR_NAMES = "id",
                                           DATA_TYPE = "integer",
                                           LABEL = NA,
                                           VALUE_LABELS = NA,
                                           KEY_STUDY_SEGMENT = "Intro",
                                           meta_data = ship_meta)

ship_meta_m <- dataquieR::prep_add_to_meta(VAR_NAMES = "intro_beg",
                                           DATA_TYPE = "string",
                                           LABEL = NA,
                                           VALUE_LABELS = NA,
                                           KEY_STUDY_SEGMENT = "Intro",
                                           meta_data = ship_meta_m)

Integrity

shipc19_app <- dataquieR::pro_applicability_matrix(study_data = ship,
                                                   meta_data = ship_meta_m)
## Warning: In dataquieR::pro_applicability_matrix: Lost 2.9% of the meta data because of missing/not assignable study-data
## > dataquieR::pro_applicability_matrix(study_data = ship, meta_data = ship_meta_m)
## Found meta data for the following variables not found in the study data: "q12"

One variable (q12) has been specified in the metadata but was not found in the study data. It is possible that the respective variable has not been exported or that another error caused this issue. For further analyses the respective metadata are excluded.

shipc19_app$ApplicabilityPlot

Completeness

UM <- dataquieR::com_unit_missingness(study_data = ship,
                                      meta_data = ship_meta_m,
                                      id_vars = "id")

Unit missingness is found in 34 records which corresponds to 68%.

SM <- dataquieR::com_segment_missingness(study_data = ship,
                                         meta_data = ship_meta_m,
                                         threshold_value = 80,
                                         direction = "high")
## Warning: In dataquieR::com_segment_missingness: Specified VARIABLE_ROLE(s) were not found in metadata. All variables are included here.
## > dataquieR::com_segment_missingness(study_data = ship, meta_data = ship_meta_m, 
##     threshold_value = 80, direction = "high")
SM$SummaryPlot
## $SummaryPlot

IM <- dataquieR::com_item_missingness(study_data = ship,
                                      meta_data = ship_meta_m,
                                      cause_label_df = shipc19_mc,
                                      show_causes = TRUE,
                                      threshold_value = 32,
                                      include_sysmiss = TRUE)
## Warning: In dataquieR::com_item_missingness: Setting suppressWarnings to its default FALSE
## > dataquieR::com_item_missingness(study_data = ship, meta_data = ship_meta_m, 
##     cause_label_df = shipc19_mc, show_causes = TRUE, threshold_value = 32,
IM$SummaryPlot

kable(IM$SummaryTable, "html") %>%
 kable_styling(bootstrap_options = c("striped", "hover"))
Variables Observations N Sysmiss N (%) Datavalues N (%) Missing codes N (%) Jumps N (%) Measurements N (%) GRADING
q01 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
q02 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
q02a 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02b 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02c 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02d 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02e 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02f 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02g 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02h 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02i 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02j 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02k 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02l 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q02m 50 3 (6) 47 (94) 31 (62) 16 (32) 0 (0) 1
q03 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
q03a 50 3 (6) 47 (94) 31 (62) 10 (20) 6 (15) 1
q03b 50 34 (68) 16 (32) 0 (0) 16 (32) 0 (0) 1
q04 50 3 (6) 47 (94) 32 (64) 0 (0) 15 (30) 1
q04a 50 34 (68) 16 (32) 1 (2) 15 (30) 0 (0) 1
q05 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
q06 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
q07 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
q07a 50 3 (6) 47 (94) 31 (62) 13 (26) 3 (8.11) 1
q07b 50 3 (6) 47 (94) 34 (68) 13 (26) 0 (0) 1
q08a 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
q08b 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
q08c 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
q09 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
q10 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
q11 50 3 (6) 47 (94) 31 (62) 0 (0) 16 (32) 0
id 50 0 (0) 50 (100) 0 (0) 0 (0) 50 (100) 0
intro_beg 50 34 (68) 16 (32) 0 (0) 0 (0) 16 (32) 0

Consistency

IAC <- dataquieR::con_inadmissible_categorical(study_data = ship,
                                               meta_data = ship_meta_m)
## Warning: In dataquieR::con_inadmissible_categorical: All variables with VALUE_LABELS in the metadata are used.
## > dataquieR::con_inadmissible_categorical(study_data = ship, meta_data = ship_meta_m)
kable(IAC$SummaryTable, "html") %>%
 kable_styling(bootstrap_options = c("striped", "hover"))
Variables OBSERVED_CATEGORIES DEFINED_CATEGORIES NON_MATCHING NON_MATCHING_N GRADING
q01 2, 3 1, 2, 3, 4, 5 0 0
q02 0 0, 1 0 0
q02a 0, 1 0 0
q02b 0, 1 0 0
q02c 0, 1 0 0
q02d 0, 1 0 0
q02e 0, 1 0 0
q02f 0, 1 0 0
q02g 0, 1 0 0
q02h 0, 1 0 0
q02i 0, 1 0 0
q02j 0, 1 0 0
q02k 0, 1 0 0
q02l 0, 1 0 0
q02m 0, 1 0 0
q03 1, 0 0, 1 0 0
q03a 0 0, 1 0 0
q04 0 0, 1 0 0
q05 0, 1 0, 1, 2 0 0
q06 0, 1 0, 1, 2 0 0
q07 0, 1 0, 1 0 0
q07a 1 0, 1 0 0
q07b 0, 1 0 0
q08a 2, 3, 1, 4 1, 2, 3, 4, 5 0 0
q08b 2, 1 1, 2, 3, 4, 5 0 0
q08c 1, 2 1, 2, 3, 4, 5 0 0
q09 3, 2, 1, 4 1, 2, 3, 4, 5 0 0
q10 3, 2, 4 1, 2, 3, 4, 5 0 0
q11 3, 4, 2 1, 2, 3, 4, 5 0 0

Accuracy

DIS <- dataquieR::acc_distributions(study_data = ship,
                                    meta_data = ship_meta_m)
## Warning: In dataquieR::acc_distributions: All variables defined to be integer or float in the metadata are used
## > dataquieR::acc_distributions(study_data = ship, meta_data = ship_meta_m)
## Warning: In dataquieR::acc_distributions: Variables q02a, q02b, q02c, q02d, q02e, q02f, q02g, q02h, q02i, q02j, q02k, q02l, q02m, q07b contain NAs only and will be removed from analyses.
## > dataquieR::acc_distributions(study_data = ship, meta_data = ship_meta_m)
## Warning: In dataquieR::acc_distributions: Variables q02, q03a, q04, q07a contain only one value and will be removed from analyses.
## > dataquieR::acc_distributions(study_data = ship, meta_data = ship_meta_m)

Distributional plots for all data including true measurements:

Conclusion

This example of a data quality report of a Covid-19 use case provides limited information due to two reasons:

  1. the study is ongoing and the current no. of observations very limited
  2. the associated metadata are not sufficient to apply important data quality assessments.

For example, no short LABELS are defined, only long labels denote the whole item questions:

Defining shorter labels - which has been done in the following example - increases the interpretability of such reports considerably:

shipc19_app <- dataquieR::pro_applicability_matrix(study_data = ship,
                                                   meta_data = ship_meta_m2,
                                                   label_col = LABEL)
## Warning: In dataquieR::pro_applicability_matrix: Lost 6.1% of the study data because of missing/not assignable meta-data
## > dataquieR::pro_applicability_matrix(study_data = ship, meta_data = ship_meta_m2, 
##     label_col = LABEL)
## Did not find any meta data for the following variables from the study data: "id", "intro_beg"
## Warning: In dataquieR::pro_applicability_matrix: Lost 3.1% of the meta data because of missing/not assignable study-data
## > dataquieR::pro_applicability_matrix(study_data = ship, meta_data = ship_meta_m2, 
##     label_col = LABEL)
## Found meta data for the following variables not found in the study data: "q12"
shipc19_app$ApplicabilityPlot