1 University Medicine of Greifswald

Correspondence: Stephan Struckmann <>

INTRODUCTION

Code developed in teams, especially if intended to be used by a larger community, needs to be comprehensible for all team members and ideally also for anyone trying to use that code. Here we provide some conventions that resulted from the 1st project phase.

The document is organized as follows:

  1. Prerequisites: summarizes major results of the project that affect the use and functionality of R-code
  2. Conventions: quotes major requirements to R-code
  3. Appendix: contains complete lists of required tables

Since the data quality project producing these conventions aims at R as the main programming language for statistical calculations, all examples and the wording will refer to R standards (see R documentation).

PREREQUISITES

Data sources

The concept distinguishes two types of data sources:

  1. Study data:

    1. Clinical data: Measurements (organized within variables) intended to be subject to data quality assessments

    2. Process information: all data providing information on the measurement process such as time, ambient variables, the respective device or examiner.

  2. Meta data:

    1. Contain the expected characteristics of study data on the level of each variable. For example, labels, limits, or missing codes. Also the allocation of respective process information organized in variables.

    2. Further tables referencing descriptions such as labels of missing codes.

For further information see Richter et al.

In this concept study data and meta data have a 1:1 correspondence, i.e. each column in the study data is identifiable in the meta data (Figure).

Table: Example study and meta data structures

R-concept

Linking data sources

Using process information

So-called process variables store meta data about the measurement process. Content of these variables represent measurements and are therefore stored with the study data. The names of the variables to use are passed in a function argument. Process variable names can also be stored in the attributes of a study variable. Such variable attributes referring other variables are usually prefixed by KEY_. Some such key attributes are listed in the table below. There is a wrapper function named pipeline_vectorized to automatically extract this information from the meta data and to provision parallel function calls with the respective function arguments. This is primarily used for calling functions of the Dimension Accuracy for a set of variables at once, because this dimension’s functions frequently

  • need process variables
  • are univariate implementations
  • are computational demanding

For brevity, we here present pseudo-code:

my_function_4 <- function(  resp_var,
                            group_vars,
                            study_data,
                            meta_data
                          ) {
  s_data     <- study_data[ , resp_var ]
  group_data <- study_data[ , group_vars ]
# ...

Calling this function using pipeline_vectorized would work as follows:

named_list_of_results <-
  pipeline_vectorized(fct = my_function_4, resp_vars = c("SBP_2", "DBP_2", "HF_2"), study_data = study_data,
                      meta_data = meta_data, label_col = LABEL,
                      args_from_meta = c(group_vars = KEY_OBSERVER),
                      mc.cores = 4)
                      
# results are a named list of the univariate results:
named_list_of_results$SBP_2
named_list_of_results$HF_2
named_list_of_results$DBP_2

Later, this may also be extended by using classes for the variable based function arguments:

my_function_5 <- function(  resp_var,
                            group_vars,
                            study_data,
                            meta_data
                          ) {
  s_data     <- study_data[ , resp_var ]
  if (inherits(group_vars, 'process_var_att')) {
    group_data <- study_data[ , subset(meta_data, VAR_NAMES == resp_var, group_vars) ]
  } else {
    group_data <- study_data[ , group_vars ]
  }
# ...
}
proc_var <- function(x) {
  class(x) <- 'process_var_att'
  x
}

my_function_5( 
  'SBP_0',
  proc_var('KEY_OBSERVER'),
  study_data,
  meta_data
)

Usability of functions

All implementations of the project were developed to applied alone or in a vectorized reporting pipeline. …

R-functions

Data quality (DQ) implementations

Functions addressing variables are divided into two sub-types: addressing one variable only (univariate) or addressing many variables (multivariate). For the univariate functions vectorisation can be performed using the pipeline_vectorized function given that they follow these conventions. Such functions perform calculations on the study data to detect quality issues.

Reporting functions

In large studies there are thousands of variables, so the quality assurance officers need some guidance to find problematic variables quickly without going through a huge per variable QA report. Therefore so-called aggregation functions are introduced that work on the output of functions that work on variables indicator functions.

Such aggregate functions differ technically from the indicator functions regarding their input being the output of other functions. This is why they depend on a sound definition of the primary functions’ output, especially they have to provide an output that values a specific data quality aspect of one or more study variables. To give an example, such a function could calculate the percentage of a study section’s variables displaying a relevant number of missings.

Helper functions

Functions that not directly address QA issues but perform consistency checks, data preparation, pipelining and other auxiliary tasks are called Helper functions and described in the section Use of Helper Functions

Small example

Example Study Data

Example Meta Data

# function calls:
my_function_1(resp_vars = colnames(study_data), co_vars = character(0), group_vars = NA, 
              label_col = 'LABEL', study_data = study_data, meta_data = meta)
## Error in my_function_1(resp_vars = colnames(study_data), co_vars = character(0), : could not find function "my_function_1"
try(
  my_function_2(resp_vars = colnames(study_data), co_vars = character(0), group_vars = NA, 
              label_col = 'LABEL', study_data = study_data, meta_data = meta) # expect to stop
)
## Error in my_function_2(resp_vars = colnames(study_data), co_vars = character(0),  : 
##   could not find function "my_function_2"

CONVENTIONS

R code

Style

R code should be structured as follows (derived from http://style.tidyverse.org/ (Hadley Wickam), and inspired by https://google.github.io/styleguide/Rguide.xml):

# required packages/code should be specified prior to user-defined functions -----------------------
library(ggplot2)

# source required functions prior to the function --------------------------------------------------
# such code will be later embedded in the R-package
source("some_other_function.R")

my_function <- function(x, formal_1, formal_n) {

   # start with all checks that safeguard applicability of the function ----------------
   if (missing(x) || length(x) == 0L || mode(x) != "numeric")
      stop("'x' must be a non-empty numeric vector")
   if (missing(formal_1) || missing(formal_n))
      stop("'attributes' must be specified")
  
   # main body of the function ---------------------------------------------------------
   x_mod <- ... # …   
   
   
   
   # call of nested function -----------------------------------------------------------
   result <- some_other_function(x_mod)

   # the output ------------------------------------------------------------------------
   return(result)
   
}

Since the targeted output is an R library (namely dataquieR), library and source should only being used during the internal drafting of code. In the R package, external libraries must be listed in the DESCRIPTION file (generally in its Imports-section) of the package and can be imported to the package namespace using roxygen2 comments.

Function definitions

To ensure a generic usability of R-scripts, they should be organised in functions whose input arguments must not be handled in a static fashion. This is necessary, because the names and the number of variables, meta data attributes, process variables as well as the names of the data frames are not known a priori. All functions must be able to handle whatever variables and data sets are used, as long as these meet some structural preconditions as outlined above.

This comprises:

  • No hard coded variable names

  • No hard coded expected lengths of variable lists

  • No hard coded data frame names

  • No function embedded meta data attributes

  • No hard coded thresholds for decision making of quality assessments

To avoid misunderstandings: Hard coded names of meta data attribute fields must be used to properly retrieve related information. All necessary information to run the scripts is transferred via an appropriate function call.

Formals and arguments

We intentionally do not use the synonymous term “function parameters” to avoid ambiguities regarding the statistical term parameter related to probability distributions.

In the following, we give a table listing standardised function argument names. Functions can have additional arguments, but for the ones listed below, conventions exist. Two of them are mandatory.

In the table above, there are two arguments (study_data and meta_data) mandatory for all indicator functions. These are declared to be data frames, which are explained in the following. The table with function arguments lists arguments mostly referring to study or process variables.

There may be additional parameters such as certain threshold values or arguments affecting the format of the generated output like specific colours or fonts. The latter should not be part of the functions in future, because the output including ggplot2 plots can be formatted later. The types of additional outputs depend on the specific use-cases. Also thresholds may be generalised later so that using threshold arguments is not recommended in favour of returning filterable results.

All function arguments are user input. So these have to be verified carefully.

Checks of formals and arguments

For arguments referring to study variables, there is a family of utility functions for this: util_correct_variable_use and util_correct_variable_use2. These can check input arguments referring to variable names. Some examples:

util_correct_variable_use("resp_vars",                              # check function argument resp_vars
                           allow_null          = TRUE,              # allow resp_vars being NULL
                           allow_more_than_one = TRUE,              # allow more than one entry in resp_vars
                           allow_any_obs_na    = TRUE,              # allow resp_vars in study_data contain NAs (see stats::na.fail)
                           need_type           = "integer | float") # allow variabes of metadata-declared types integer or float

util_correct_variable_use("group_vars",                   # check function argument group_vars
                          allow_null          = TRUE,     # allow group_vars being NULL
                          allow_more_than_one = TRUE,     # allow more than one entry in group_vars
                          allow_any_obs_na    = TRUE,     # allow group_vars in study_data contain NAs (see stats::na.fail)
                          need_type           = "!float") # allow variabes of all possible  metadata-declared types except float

Please refer to the full documentation of util_correct_variable_use / util_correct_variable_use2 for an exhaustive reference.

Note, that util_correct_variable_use* are utility functions and hence intended for package internal use only. The package dataquieR does not export these functions, they will only be found if called from within that package or if called explicitly with the disadvised :::-operator during development. During drafting functions, we recommend import of all used functions to the global environment as follows:

util_correct_variable_use <- dataquieR:::util_correct_variable_use

Robustness checks

Checks for parameter not referring to variables can be performed using standard R functions such as is.numeric, na.fail, missing, is.null, length, stopifnot, inherits. Be careful with is.integer: This functions checks the declared type but not the real type of a vector:

a <- 12
is.integer(a)
## [1] FALSE
b <- as.integer(12)
is.integer(b)
## [1] TRUE
a == b
## [1] TRUE
identical(a, b)
## [1] FALSE
str(a)
##  num 12
str(b)
##  int 12

Therefore, we have included a utility function as proposed in the manual page of is.integer, which is called util_is_integer. This function behaves as expected and returns true also for the variable a from the example above. As for all utility functions, util_is_integer is not exported by the dataquieR package but can be accessed from functions in the package. Again we recommend copying the function to the global environment when drafting a function without compiling the package.

my_function_1 <- function(resp_vars,     # vector of response variables, i.e. each of 
                                         # these variables is analysed 
                          co_vars,       # vector of additional variables used for 
                                         # adjustment or similar
                          group_vars,    # CAVE: currently only one grouping variable
                          label_col,     # meta data variable attribute to use for naming variables 
                                         # in the output
                          study_data,    # data frame of study records
                          meta_data      # data frame of meta data attributes
) {
  
  # Replace the column names of the data in "study_data" by the corresponding short variable
  # labels. This step ensures comprehensive output. Convention: not more than 20 characters.
  
  # "meta_data" must provide a row for each column in "study_data", a unique and alphanumeric
  # label must be contained.
  
  translations <- setNames(meta_data[[label_col]], nm = meta_data$VAR_NAMES) # generate a named 
                                                                             # vector translating 
                                                                             # names to labels
  translationEnv <- as.environment(as.list(translations))  # convert it to an environment 
                                                           # for use with mget
  translated <- mget(colnames(study_data), translationEnv) # use mget to get translated 
                                                           # column labels
  ds1 <- study_data                                        # do not modify the original data frame
  colnames(ds1) <- unlist(translated)                      # use the translted as new column names
  
  r <- lapply(seq_along(ds1), function(v) {
    sum(meta_data[v, "INCL_SOFT_LIMIT_UP"] < ds1[, v])
  })
  
  names(r) <- colnames(ds1)
  r <- simplify2array(r)
  return(r)
}

The mapping of meta data variable labels and based on variable names is performed by the utility function util_prepare_dataframes, which can be used like a C Makro. After having called util_prepare_dataframes without arguments from a function that follows the here listed conventions, a new object is created in the function’s local environment named ds1. Using this, the function above will looks as follows:

my_function_1b <- function(resp_vars,     # vector of response variables, i.e. each of 
                                         # these variables is analysed 
                          co_vars,       # vector of additional variables used for 
                                         # adjustment or similar
                          group_vars,    # CAVE: currently only one grouping variable
                          label_col,     # meta data variable attribute to use for naming variables 
                                         # in the output
                          study_data,    # data frame of study records
                          meta_data      # data frame of meta data attributes
) {
  
  util_prepare_dataframes()
  
  r <- lapply(seq_along(ds1), function(v) {
    sum(meta_data[v, "INCL_SOFT_LIMIT_UP"] < ds1[, v])
  })
  
  names(r) <- colnames(ds1)
  r <- simplify2array(r)
  return(r)
}

Note, that util_prepare_dataframes is a utility function and hence intended for package internal use only. The package dataquieR does not export that function, it will only be found if called from within that package or if called explicitly with the disadvised :::-operator during development. During drafting functions, we recommend import of all used functions to the global environment as follows:

util_prepare_dataframes <- dataquieR:::util_prepare_dataframes

Once a function has been integrated to dataquieR, it will find the package internal functions without any tweaks.

Please refer to the full documentation of util_prepare_dataframes for an exhaustive reference.

my_function_2 <- function(resp_vars,     # vector of response variables, i.e. each of 
                                         # these variables is analysed 
                          co_vars,       # vector of additional variables used for 
                                         # adjustment or similar
                          group_vars,    # CAVE: currently only one grouping variable
                          label_col,     # meta data variable attribute to use for naming variables 
                                         # in the output
                          study_data,    # data frame of study records
                          meta_data      # data frame of meta data attributes
) {
  ## in case of a function that handles one variable at once:
  if (length(resp_vars) > 1)
    stop("my_function_2 cannot handle more than one variable at once.")
  # ...
}

All functions should carefully check all their input and abort the execution with understandable error messages, if some preconditions are not met. To cover the most common cases, some utility functions have been implemented (util_prepare_dataframes and util_correct_variable_use). util_prepare_dataframes checks for function it has been called by, if its mandatory standard function arguments study_data and meta_data provide the expected valid data and if these two data frames match. util_correct_variable_use can be called for each argument referring one or more variables by their names. It can be parameterised to check for the most common mistakes, e.g. too few / too many variable names, or referred variables of unsuitable data types.

Helper Functions

There are more helper functions except the two mentioned in the section Checks to be performed / robustness. All internal helper functions should be prefixed by util_. The util_ functions will not be exported by the R-package, because these are not intended to be used by end users directly. Because also the users of the functions will need some helper functions for processing data and generating quality reports, there are two more prefixes, namely prep_ for general data processing and pipe_ for stuff related to automated report generation.

Documentation

Documentation in this project is function specific, depending on whether the user is enabled to edit the code. should exist.

Please refer to roxygen2’s package documentation, R documentation about packages, and vignette.

Data quality implementation

This type of functions will be mostly used by the users and has therefore two routes for documentation:

a. for handling and meaning of the code use [RMarkdown](https://rmarkdown.rstudio.com/){target=_blank} for all documentation, i.e. [function help pages](http://r-pkgs.had.co.nz/vignettes.html#markdown){target=_blank} and [package vignettes](https://roxygen2.r-lib.org/articles/markdown.html){target=_blank}.

b. for integration into the R-package use [Roxygen2 comments](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html){target=_blank}. 

Reporting functions

For handling and meaning of code use RMarkdown for all documentation, i.e. function help pages and package vignettes.

Helper functions

All functions should be documented using comments as above and also using Roxygen2 comments.

Input

Study data

The structure of study data has to comply with the following conventions to be applicable in our framework:

  • Study data is usually stored in tables (in R we use instances of the class data.frame, data frames).

  • Study data frames have one sample/patient per row and one variable per column. This corresponds to a “wide format”. Conversion from long/narrow format to wide format can be performed in R using several packages.

  • The column headers of study data frames are variable names.

  • Variable names must be unique.

  • Variable names do not contain blanks or other non-alphanumeric characters except for dots and underscores. They do not start with non-alphanumeric characters.

  • In case of repeated measurements, the names of variables measured repeatedly should receive a suffix indicating the measurement order (e.g. blood_01 blood_02 blood_03)

Meta data

Meta data are arguments for the indicator functions. They are provided to these functions as meta data frames in their function argument meta_data. For functions that handle only one variable at once the structure of the meta data will be identical as for multivariate functions. All functions extract the relevant columns from the full meta data frame provided to them.1 For further details see the specific examples below.

Output

Elements

The output of a data quality function must contain the following elements:

  • The data quality related results as text, graph or table.

  • If possible, machine readable output of the data underlying the results (particularly for graphs), preferably in form of a data frame

Output should be usable in RMarkdown files.

It is desirable not to implement a new function for each output option. Returned data frames as well as ggplot2 based graphics can be modified and laid out later.

If unavoidable, we accept function parameters to control the output.

Overview of output elements

The output of the functions is given as a named R list. The following names are consented:

  • SummaryTable
    • a data frame with values about the data quality (e.g. the percentage of missings per variable)
  • SummaryPlot
    • a ggplot2 graph visualising the results

These will be amended by:

  • DQvalue
    • a categorical value rating the data quality output (critical, undecided, good, …)

If a function provides specific output for a set of response variables (resp_vars missing or a vector), these specific outputs should be elements in the list, named by the VAR_NAMES. Additionally, such functions can still provide a SummaryTable and/or a SummaryPlot for all variables. Also a summary DQvalue should be available.

The example a function may generate the data frame below as the primary result:

df1

This data frame can then be used for a respective graph and both results are returned:

# COMMENT: call ggplot
  p1 <- ggplot(df1, aes(x = x1, y = y_prob)) +
        theme_bw() + 
        geom_bar( aes( fill = cave), stat = "identity") + 
        scale_fill_manual( values = c("#2166AC", "#B2182B"), guide=FALSE) +
        geom_errorbar( aes( ymin = lcl, ymax = ucl), width = 0.1) +
        geom_line(data =  df2, aes( x = x2, y = y_line, color = "#E69F00"), 
            size = 2) + 
        scale_color_manual( values = c("#E69F00"), guide = FALSE)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
return(list(SummaryTable = df1, SummaryPlot = p1))

Axes in plots

Functions are written without precise knowledge about the application context. Therefore it must be safeguarded, that information remains readable, even if for example the number of variables or clusters grows very large. To support this:

  • Categories of known dimension should be implemented in the horizontal axis

  • Categories of unknown dimension (e.g. number of variables) should be implemented in the vertical axis

This applies primarily to printed text document formats (pdf, docx). For a html display of results, limitations to the handling of axes apply to a lesser degree.

This conventions should later be controllable by a function argument to faciliate to comply with external restrictions.

Colours

The relevant data quality information should be available not only based on colours but also based on additional elements, e.g. the amount of an effect size, or a line indicating a range or variance.

Colours in plots

Using ggplot2 allows to manipulate the colours later. Nevertheless, we recommend the following conventions for colours.

Colours for discrete scales

For discrete scales, such as interviewers or centres, we recommend to generate colour-blind friendly figures as recommended here: http://bconnelly.net/2013/10/creating-colorblind-friendly-figures/

We have augmented the list by two colours: grey and brown.

QS_Name hex_code red green blue
qs_black #000000 0 0 0
qs_gray #B0B0B0 176 176 176
qs_orange #E69F00 230 159 0
qs_skyblue #56B4E9 86 180 233
qs_green #009E73 0 158 115
qs_yellow #F0E442 240 228 66
qs_blue #0072B2 0 114 178
qs_red #D55E00 213 94 0
qs_purple #CC79A7 204 121 167
qs_brown #8C510A 140 81 10

These colours are not recommended for the representation of data quality issues.

Colours for continuous scales

For continuous scales, such as magnitude of effects, we recommend to generate colour-blind friendly figures as recommended here http://colorbrewer2.org/#type=sequential&scheme=PuBu&n=9. Alternatively, we recommend the use of Viridis:

https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html

These colours are not recommended for the representation of data quality classifications.

Colours indicating the magnitude of data quality issues

To graph data quality issues, we recommend to generate colour-blind friendly figures as recommended here http://colorbrewer2.org/#type=sequential&scheme=PuBu&n=9

The red pole should always be used to identify problems.

Colours in tables

We assume, a function produces the following data frame:

Using the R-package “formattable” this data frame can be formatted as follows (crude example):

Please see some annotation on the use of this R-package; other important options are datatable (DT) and knitr::kableExtra

Colours for tables should be based on the colours mentioned above for graphical output.

Functionality of output

Data quality related output should:

  • allow for an overview over all checked data structures (e.g. variables)

  • allow for an overview over all checked data structures with a data quality finding

  • allow for an overview over all checked data structures with a data quality finding, crossing a defined threshold

  • use space as efficiently as possible

  • allow for an understanding of tables or graphs without using other information sources

APPENDIX

List of variable attributes

The outline of attributes is only defined as far as it is needed to run a single data quality assessment routine. Variable attributes comprise static meta data attributes (e.g. limits for a metric study variable) and process variable assignments (e.g. the study variable that stores the ID of the device used to measure some outcome variable).

A list of attributes is provided in the table below with suggested naming conventions. Attributes starting with the prefix KEY_ contain for each single study variable its references to other study variables identified by their respective VAR_NAMES entry. The meta data attribute VARIABLE_ROLE categorises the variables. An automated analysis of the Accuracy dimension related properties of a study variable considers all KEY_ attributes of that study variable that refer to study variables of the category PROCESS.

meta_atts_table <- openxlsx::read.xlsx("media/variable_attributes.xlsx")
DT::datatable(meta_atts_table, options = list(pageLength = min(20, nrow(meta_atts_table))), elementId = "variable_attributes")

The prefix INCL_ is used for generated variable attributes added automatically for internal use. The formats mentioned in the table above are:

Name Description Examples
String character data labels such as BPSYST_01
Numeric numeric data variable order numbers
Enumeration(A, B, C) categorical data with the listed categories data types
Assignment assignments expressed using = and separated using ǀ 0 = females ǀ 1 = males
CSV comma separated values missing code lists like 99999, 88888, 12345
Interval Interval notation using [ ]/( ) for including/excluding limits and Inf/-Inf for open intervals [50;Inf), [0;10], [0;10], (-Inf;2]
Variable Reference Reference to some other variable storing meta information about each value of the current variable variable names as given in the meta data attribute VAR_NAMES, e.g. v00019 or SBP_0

The recommended attributes will be expanded according to the needs of the project and can be extended to the needs of the users of the generated R routines.

To facilitate editing/creating meta data attributes, a Shiny App has been implemented.

Lists in variable attributes

Some meta data vary regarding their length across study variables. Lists of different length cannot be represented in a rectangular data frame. This is the case e.g. for missing lists.

Missing codes for two study variables:

Study variable “v33247”: 99980, 99974, 99976, 99982, 99975, 99992, 99990, 99989, 99995

Study variable “v33259”: 99998, 99999, 99984, 99990, 99975, 99982, 99992, 99977, 99986, 99989, 99981, 99987, 99983

Such structures are given as comma separated strings within the meta data data frames. A function has to know about the exact meaning of and can then handle this.

# Example use in a function
my_function_3 <- function(study_data, meta_data) {
  
  ml_var1 <- subset(meta_data, VAR_NAMES == "var_1", select = "MISSING_LIST", drop = TRUE)
  ml_var1_vector <- strsplit(ml_var1, ",", fixed = TRUE)[[1]]
  
  value1 = 99980
  value2 = 75.35
  
  if (value1 %in% ml_var1_vector)
    print("For var_1, value1 is a missing code")
  
  if (value2 %in% ml_var1_vector)
    print("For var_1, value2 is a missing code")
  
}

my_function_3(study_data = study_data, meta_data = meta_data)
## [1] "For var_1, value1 is a missing code"

Missing codes

If missing codes are used consistently for all variables of one analysis, the corresponding functions from the dimension Completeness can generate output with labels, if a table translating missing codes to labels is given. This table should be in a CSV format using ; as field separator and containing a header line. The 2 columns should be CODE_VALUE and CODE_LABEL. Such files can be read by readr::read_csv2 or utils::read.csv2. An example is explained here and available from here.

An example is given below:

  CODE_VALUE;CODE_LABEL
  99980;Missing - other reason
  99981;Missing - exclusion criteria
  99982;Missing - refusal
  99983;Missing - not assessable 
  99984;Missing - technical problem   
  [...]

Contradictions

Contradiction checks are widely used in all kind of studies. A classical example is the number of pregnancies for male study participants not being zero. Such checks are available in the dimension Consistency. They are provided as a separate table which can be referred to by the meta data attribute CONTRADICTIONS but more importantly vice versa refer to variable labels (variable attribute LABEL).

Each rule refers two items that can be variable names, values or lists of levels/categories (at least one item must refer a variable). The rules also refer a function parameterised by these items. The rule then expresses a contradiction.

The available functions are given below (A and B refer to the referred items):

  • A_greater_equal_B_vv, A_greater_than_B_vv: Values of variable A should not be greater or equal to that of variable B, e.g. age at baseline should not be greater than age at followup.
  • A_less_equal_B_vv, A_less_than_B_vv: see A_greater_equal_B but here, A should not be less than B, e.g. age at followup should not be less than age at baseline.
  • A_levels_and_B_gt_value_lc: If A has a certain levels/categories, B should not be greater than a value, e.g. value of pregnant is yes but age is greater than 70.
  • A_levels_and_B_levels_ll: If A has certain levels/categories, B should not have some other certain levels, e.g. if the value of smoking is no, the value of cigarette consumption should not be heavy, frequently, rarely.
  • A_levels_and_B_lt_value_lc: If A has certain levels/categories, B should not be less than a value – see A_levels_and_B_gt_value_lc.
  • A_not_equal_B_vv: A and B should not differ (gender at baseline and gender at followup should usually be the same)
  • A_present_and_B_vv, A_present_not_B_vv: A has a value and B has too/has not, e.g. expected birth-date is available but the presumed date of fertilisation is not.

The rules are stored in a CSV file using # as field separator and containing a header line. Each rule has an ID to be amended in the meta data for easier finding variables covered by contradiction rules. The rules may also have labels for improved output formatting and tags/categories for stratifying and aggregating contradictions. The names of the columns are:

  • ID
  • Function_name
  • A
  • A_levels
  • A_value
  • B
  • B_levels
  • B_value
  • Label
  • tag (optional, comma separated values allowed)

An example is given below:

  ID#Function_name#A#A_levels#A_value#B#B_levels#B_value#Label
  1001#A_less_than_B_vv#AGE_1#NA#NA#AGE_0#NA#NA#Age follow-up
  1002#A_not_equal_B_vv#SEX_1#NA#NA#SEX_0#NA#NA#Sex follow-up
  1003#A_less_than_B_vv#EDUCATION_1#NA#NA#EDUCATION_0#NA#NA#Education follow-up
  1004#A_levels_and_B_levels_ll#EATING_PREFS_0#vegetarian#NA#MEAT_CONS_0#1-2d a week | 3-4d a week | 5-6d a week | daily#NA#Nutrition inconsistency vegetarian
  1005#A_levels_and_B_levels_ll#EATING_PREFS_0#vegan#NA#MEAT_CONS_0#1-2d a week | 3-4d a week | 5-6d a week | daily#NA#Nutrition inconsistency vegan   
  [...]

To facilitate editing/creating contradiction rules, a Shiny App has been implemented.


  1. Function arguments in R: Technically, R passes arguments by reference but employs copy-on-write making arguments looking like being passed by value. Therefore, passing around large constant data frames is usually not a performance problem except for specific forms of parallel computing.↩︎