APPROACH

This approach considers a contradiction if impossible combinations of data are observed in one participant. For example, if age of a participant is recorded repeatedly the value of age is (unfortunately) not able to decline. Most cases of contradictions rest on comparison of two variables.

Important to note, each value that is used for comparison may represent a possible characteristic but the combination of these two values is considered to be impossible. The approach does not consider implausible or inadmissible values.

ALGORITHM OF THIS IMPLEMENTATION:

  1. Select all variables in the data with defined contradiction rules (static metadata column CONTRADICTIONS)
  2. Remove missing codes from the study data (if defined in the metadata)
  3. Remove measurements deviating from limits defined in the metadata
  4. Assign label to levels of categorical variables (if applicable)
  5. Apply contradiction checks on predefined sets of variables
  6. Identification of measurements fulfilling contradiction rules. Therefore two output data frames are generated:
    • on the level of observation to flag each contradictory value combination, and
    • a summary table for each contradiction check.
  7. A summary plot illustrating the number of contradictions is generated.

Example of study data

Data from the package dataquieR are loaded as shown below:

load(system.file("extdata", "study_data.RData", package = "dataquieR"))
sd1 <- study_data

This example of study data has N=3000 observations. Study data variables have abstract and non-interpretable names; appropriate labels must be mapped from the metadata. Nonetheless, the study comprise the following characteristics:

  • Age at baseline + age during follow-up
  • Sex + sex during follow-up
  • Education + education during follow-up
  • eating preferences
  • weekly meat consumption
  • smoking
  • shopping behavior regarding tobacco products
  • circumference of upper arm
  • used arm cuff for blood pressure measurement
  • pregnancy status of women
  • some medication
v00000 v00001 v00002 v00003 v00004 v00005 v01003 v01002 v00103 v00006
3 LEIIX715 0 49 127 77 49 0 40-49 3.8
1 QHNKM456 0 47 114 76 47 0 40-49 1.9
1 HTAOB589 0 50 114 71 50 0 50-59 0.8
5 HNHFV585 0 48 120 65 48 0 40-49 3.8
1 UTDLS949 0 56 119 78 56 0 50-59 4.1
5 YQFGE692 1 47 133 81 47 1 40-49 9.5
1 AVAEH932 0 53 114 78 53 0 50-59 5.0
3 QDOPT378 1 48 116 86 48 1 40-49 9.6
3 BMOAK786 0 44 115 71 44 0 40-49 2.0
5 ZDKNF462 0 50 116 74 50 0 50-59 2.4

Example of metadata

Data from the package dataquieR are loaded as shown below:

load(system.file("extdata", "meta_data.RData", package = "dataquieR"))
md1 <- meta_data

Information corresponding to the study data is kept in the table of static metadata. An interpretable label for each variable is also attached. Besides data type and labels of all variables further expected characteristics are stored in the metadata.

Regarding the following implementation the columns CONTRADICTIONS as well as MISSING_LIST, VALUE_LABELS, and HARD_LIMITS in the metadata are particularly relevant.

The column of CONTRADICTION contains only IDs for explicit contradictions. Respective definition can be done in the metadata but we recommend the use of an associated ShinyApp (Chang et al. 2018, Potter et al. 2016). See also [Definition of contradictions].

VAR_NAMES LABEL MISSING_LIST VALUE_LABELS HARD_LIMITS CONTRADICTIONS
3 v00002 SEX_0 NA 0 = females | 1 = males NA 1002
4 v00003 AGE_0 NA NA [18;Inf) ] | 3 = (30, Inf] [1;3] 1009
16 v00010 ARM_CUFF_0 99980 | 99987 1 = (-Inf,20] | 2 = (20,30] | 3 = (30, Inf] [1;3] 1009
19 v00013 EXAM_DT_0 NA NA

[2018-01-01 00:00:00 CET;) ]) a_lev <- unlist(strsplit(a_lev, SPLIT_CHAR, fixed = TRUE)) a_lev <- trimws(a_lev)

  b_lev <- gsub("'", "", ct$B_levels[ct$ID == cl[i]])
  b_lev <- unlist(strsplit(b_lev, SPLIT_CHAR, fixed = TRUE))
  b_lev <- trimws(b_lev)
  # apply check
  summary_df1[i + 1] <-
    contradiction_functions[[check]](study_data = ds1_ll,
    A = paste(ct$A[ct$ID == cl[i]]),
    A_levels = a_lev,
    A_value = ct$A_value[ct$ID == cl[i]],
    B = paste(ct$B[ct$ID == cl[i]]),
    B_levels = b_lev,
    B_value = ct$B_value[ct$ID == cl[i]]
  )

  # summarize checks
  summary_df2[i, 1] <- cl[i]
  summary_df2[i, 2] <- check
  summary_df2[i, 3] <- paste0(
    "A is: ", ct$A[ct$ID == cl[i]], "; ",
    "B is: ", ct$B[ct$ID == cl[i]]
  )
  summary_df2[i, 4] <- paste(a_lev, collapse = SPLIT_CHAR)
  summary_df2[i, 5] <- paste(b_lev, collapse = SPLIT_CHAR)
  summary_df2[i, 6] <- sum(summary_df1[, i + 1], na.rm = TRUE)
  summary_df2[i, 7] <- sum(summary_df1[, i + 1], na.rm = TRUE) /
    dim(ds1)[1] * 100
  summary_df2[i, 8] <- ifelse(summary_df2[i, 7] > threshold_value, 1, 0)
  summary_df2[i, 9] <- ct$Label[ct$ID == cl[i]]
}

summary_df2$Percent <- round(summary_df2$Percent, digits = 2)

names(summary_df2) <- c(
  "Check ID", "Check type", "Variables A and B", "A Levels",
  "B Levels", "Contradictions (N)", "Contradictions (%)",
  "GRADING", "Label"
)

summary_df2$GRADING <- ordered(summary_df2$GRADING)

x <- util_as_numeric(reorder(summary_df2[, 1], -summary_df2[, 1]))
lbs <- as.character(reorder(summary_df2[, 9], -summary_df2[, 1]))
# plot summary_df2
p <- ggplot(summary_df2, aes_(x = ~x, y = ~ summary_df2[, 7], fill =
                                ~ as.ordered(GRADING))) +
  geom_bar(stat = "identity") +
  geom_text(
    y = round(summary_df2[, 7], 1) + 0.5,
    label = paste0(round(summary_df2[, 7], digits = 2), "%")
  ) +
  scale_fill_manual(values = cols, name = " ", guide = "none") +
  theme_minimal() +
  xlab("IDs of applied checks") +
  scale_y_continuous(name = "(%)",
                     limits = (c(0, max(summary_df2[, 7]) + 1))) +
  scale_x_continuous(breaks = x, sec.axis =
                       sec_axis(~., breaks = x, labels = lbs)) +
  geom_hline(yintercept = threshold_value, color = "red", linetype = 2) +
  coord_flip() +
  theme(text = element_text(size = 20))

# create SummaryTable object
st1 <- summary_df2
st1$`Variables A and B` <- gsub("A is: ", "", st1$`Variables A and B`)
st1$`Variables A and B` <- gsub("B is: ", "", st1$`Variables A and B`)
st1$Variables <- unlist(lapply(st1$`Variables A and B`,
                               function(x) unlist(strsplit(x, ";",
                                                           fixed =
                                                             TRUE))[1]))
st1$`Reference variable` <- unlist(lapply(st1$`Variables A and B`,
                                          function(x) unlist(
                                            strsplit(x, ";", fixed =
                                                       TRUE))[2]))
st1$`Variables A and B` <- NULL
st1 <- st1[, c(9, 10, 1:8)]
#st1 <- dplyr::rename(st1, c("GRADING" = "Grading"))

suppressWarnings({
  # suppress wrong warnings: https://github.com/tidyverse/ggplot2/pull/4439/commits
  # find out size of the plot https://stackoverflow.com/a/51795017
  bp <- ggplot_build(p)
  w <- 2 * length(bp$layout$panel_params[[1]]$x$get_labels())
  if (w == 0) {
    w <- 10
  }
  w <- w + 2 +
    max(nchar(bp$layout$panel_params[[1]]$y$get_labels()),
        na.rm = TRUE)
  w <- w +
    max(nchar(bp$layout$panel_params[[1]]$y.sec$get_labels()),
        na.rm = TRUE)
  h <- 2 * length(bp$layout$panel_params[[1]]$y$get_labels())
  if (h == 0) {
    h <- 10
  }
  h <- h + 15
  p <- util_set_size(p, width_em = w, height_em = h)
})

# Output
return(list(
  FlaggedStudyData = summary_df1,
  SummaryTable = st1,
  SummaryData = summary_df2,
  SummaryPlot = p
))

}

# Never called, just for documentation. return(list( # nocov start FlaggedStudyData = summary_df1, SummaryTable = st1, SummaryData = summary_df2, SummaryPlot = p )) # nocov end }



## Implementation and use of thresholds

The implementation above uses a threshold based on percentages (0-100). Specification of the *threshold_value* is mandatory.

## Call of the R-function


```r
AnyContradictions <- con_contradictions(study_data      = sd1,
                                        meta_data       = md1,
                                        label_col       = "LABEL",
                                        check_table     = checks,
                                        threshold_value = 1)
## Labels of variables from "LABEL" will be used. In this case columns A and B in check_tables must refer to labels.
## Warning: In con_contradictions: All variables with CONTRADICTIONS in the metadata are used.
## > con_contradictions(study_data = sd1, meta_data = md1, label_col = "LABEL", 
##     check_table = checks, threshold_value = 1)
## Warning: In con_contradictions: N = 3 values in EDUCATION_1 have been above HARD_LIMITS and were removed.
## > con_contradictions(study_data = sd1, meta_data = md1, label_col = "LABEL", 
##     check_table = checks, threshold_value = 1)
## Warning: In con_contradictions: N = 24 values in SMOKE_SHOP_0 have been above HARD_LIMITS and were removed.
## > con_contradictions(study_data = sd1, meta_data = md1, label_col = "LABEL", 
##     check_table = checks, threshold_value = 1)
## Warning: In con_contradictions: Variables: AGE_0, AGE_1, EXAM_DT_0, LAB_DT_0 have no assigned labels and levels.
## > con_contradictions(study_data = sd1, meta_data = md1, label_col = "LABEL", 
##     check_table = checks, threshold_value = 1)

OUTPUT

Output 1: FlaggedStudyData

This implementation returns four objects. The dataframe FlaggedStudyData flags each observation in the study data that has one or more contradictions between different variables. For each applied check on the variables an additional column (names with the ID of the check) is added. The object can be accessed via AnyContradictions$FlaggedStudyData.

Output 2: Summary table 1

The second output of the contradiction function is a data frame which summarizes the no. of contradictions for each variable that has been examined. This object is primarily used by the dataquieR-function dq_report to summarize information of all examined variables.

Variables Reference variable Check ID Check type A Levels B Levels Contradictions (N) Contradictions (%) GRADING Label
AGE_1 AGE_0 1001 A_less_than_B_vv NA NA 150 5.00 1 Age follow-up
SEX_1 SEX_0 1002 A_not_equal_B_vv NA NA 150 5.00 1 Sex follow-up
EDUCATION_1 EDUCATION_0 1003 A_less_than_B_vv NA NA 7 0.23 0 Education follow-up
EATING_PREFS_0 MEAT_CONS_0 1004 A_levels_and_B_levels_ll vegetarian 1-2d a week|3-4d a week|5-6d a week|daily 54 1.80 1 Nutrition inconsistency vegetarian
EATING_PREFS_0 MEAT_CONS_0 1005 A_levels_and_B_levels_ll vegan 1-2d a week|3-4d a week|5-6d a week|daily 19 0.63 0 Nutrition inconsistency vegan
EATING_PREFS_0 MEAT_CONS_0 1006 A_levels_and_B_levels_ll none never 64 2.13 1 Nutrition inconsistency


Output 3: Summary table 2

The third output summarizes this information quite similarly but also names the applied checks. This output can be used to provide an executive overview on the amount of contradictions.

Check ID Check type Variables A and B A Levels B Levels Contradictions (N) Contradictions (%) GRADING Label
1001 A_less_than_B_vv A is: AGE_1; B is: AGE_0 NA NA 150 5.00 1 Age follow-up
1002 A_not_equal_B_vv A is: SEX_1; B is: SEX_0 NA NA 150 5.00 1 Sex follow-up
1003 A_less_than_B_vv A is: EDUCATION_1; B is: EDUCATION_0 NA NA 7 0.23 0 Education follow-up
1004 A_levels_and_B_levels_ll A is: EATING_PREFS_0; B is: MEAT_CONS_0 vegetarian 1-2d a week|3-4d a week|5-6d a week|daily 54 1.80 1 Nutrition inconsistency vegetarian
1005 A_levels_and_B_levels_ll A is: EATING_PREFS_0; B is: MEAT_CONS_0 vegan 1-2d a week|3-4d a week|5-6d a week|daily 19 0.63 0 Nutrition inconsistency vegan
1006 A_levels_and_B_levels_ll A is: EATING_PREFS_0; B is: MEAT_CONS_0 none never 64 2.13 1 Nutrition inconsistency
1007 A_levels_and_B_levels_ll A is: SMOKING_0; B is: SMOKE_SHOP_0 no 1-2d a week|3-4d a week|5-6d a week|daily 91 3.03 1 Non-smokers inconsistency
1008 A_levels_and_B_levels_ll A is: SMOKING_0; B is: SMOKE_SHOP_0 yes never 118 3.93 1 Smokers inconsistency
1009 A_not_equal_B_vv A is: ARM_CIRC_DISC_0; B is: ARM_CUFF_0 NA NA 173 5.77 1 Blood pressure false cuff
1010 A_levels_and_B_gt_value_lc A is: PREGNANT_0; B is: AGE_0 yes NA 5 0.17 0 Pregnancy high age
1011 A_less_than_B_vv A is: LAB_DT_0; B is: EXAM_DT_0 NA NA 116 3.87 1 LAB before MEX

Output 4: Summary plot

The fourth output visualizes summarized information of output 2 and 3.

AnyContradictions$SummaryPlot

INTERPRETATION

Any contradiction in the study data should be resolved by appropriate data curation steps.

Concept relations

Chang, W., Cheng, J., Allaire, J., Xie, Y., McPherson, J., et al. (2018). Shiny: Web application framework for r, 2015. R Package Version 1, 14.
Potter, G., Wong, J., Alcaraz, I., Chi, P., et al. (2016). Web application teaching tools for statistics using r and shiny. Technology Innovations in Statistics Education 9.