Description

The acc_varcomp function examines the impact of so-called process variables on the measurement variables through variance based models and intraclass correlations (ICC). This implementation is model-based. The function can be applied on variables of type float.

Note: The term ICC is more frequently used to describe the agreement between different observers, examiners or even devices. In respective settings, a good agreement is pursued. ICC-values can vary between \([-1; \: 1]\) and an ICC close to \(1\) is desired (Koo and Li 2016, Müller and Büttner 1994).

In multi-level analysis the ICC is interpreted differently. Please see Snijders et al. (Sniders and Bosker 1999). In this context, the proportion of variance explained by respective group levels indicates an influence of (at least one) level of the respective group_vars.

Irrespective of the used terminology, regarding data quality it is desired that process variables do not explain systematically components of variance. Therefore, values close to \(0\) are desired.

acc_varcomp is an implementation of the Unexpected location indicator, which belongs to the Unexpected distributions domain in the Accuracy dimension.

For more details, see the user’s manual and source code.

Usage and arguments

acc_varcomp(
  resp_vars = NULL,
  group_vars = NULL,
  co_vars = NULL,
  min_obs_in_subgroup = 30,
  min_subgroups = 5,
  label_col = NULL,
  threshold_value = 0.05,
  study_data = sd1,
  meta_data = md1
)

The function has the following arguments:

  • study_data: mandatory, the data frame containing the measurements.
  • meta_data: mandatory, the data frame containing the study data’s metadata.
  • resp_vars: mandatory, a character specifying the measurement variable of interest. The variable must be of float type.
  • label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.
  • group_vars: the variable used for grouping (e.g., observer, device, reader). Defaults to NULL for output without grouping.
  • co_vars: optional, a vector of covariables, e.g. age and sex for adjustment.
  • min_obs_in_subgroup: optional if group_vars is used. Specifies the minimum number of observations required to include a subgroup (level) of the group_vars in the analysis. Subgroups with less observations are excluded. The default is 30.
  • min_subgroups: optional if group_vars is used. Specifies the minimum number of subgroups (levels) included group_vars. If the variable defined in group_vars has less subgroups it is not used for analysis. The default is 5.
  • threshold_value: optional, a numerical value ranging from 0 to 1. If no value is specified, the default value of 0.05 will be used.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

Similar to the approach of the acc_margins function, we assume that at least one examiner does not adhere to the SOP and may influence the measurement process:

v00000 v00001 v00002 v00003 v00004 v00005 v01003 v01002 v00103 v00006
3 LEIIX715 0 49 127 77 49 0 40-49 3.8
1 QHNKM456 0 47 114 76 47 0 40-49 1.9
1 HTAOB589 0 50 114 71 50 0 50-59 0.8
5 HNHFV585 0 48 120 65 48 0 40-49 3.8
1 UTDLS949 0 56 119 78 56 0 50-59 4.1
5 YQFGE692 1 47 133 81 47 1 40-49 9.5
1 AVAEH932 0 53 114 78 53 0 50-59 5.0
3 QDOPT378 1 48 116 86 48 1 40-49 9.6
3 BMOAK786 0 44 115 71 44 0 40-49 2.0
5 ZDKNF462 0 50 116 74 50 0 50-59 2.4


For the acc_varcomp function, the columns DATA_TYPE, MISSING_LIST and HARD_LIMITS in the metadata are relevant:

VAR_NAMES LABEL MISSING_LIST DATA_TYPE HARD_LIMITS
9 v00004 SBP_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 float [80;180]
10 v00005 DBP_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 float [50;Inf)
11 v00006 GLOBAL_HEALTH_VAS_0 99980 | 99983 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 float [0;10]
14 v00009 ARM_CIRC_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 float [0;Inf)
21 v00014 CRP_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99988 | 99989 | 99990 | 99991 | 99992 | 99994 | 99995 float [0;Inf)
22 v00015 BSG_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99988 | 99989 | 99990 | 99991 | 99992 | 99994 | 99995 float [0;100]


Here, the function is applied to examine the agreement between observers (USR_BP_0) for the systolic and diastolic blood pressure variables (SBP_0 and DBP_0, respectively):

varcomp_1 <- acc_varcomp(resp_vars = c("SBP_0", "DBP_0"),
                group_vars = c("USR_BP_0"),
                co_vars = c("AGE_0", "SEX_0"),
                label_col = "LABEL",
                min_obs_in_subgroup = 20,
                min_subgroups = 3,
                study_data = sd1,
                meta_data = md1)
## Did not find any 'SCALE_LEVEL' column in item-level meta_data. Predicting it from the data -- please verify these predictions, they may be wrong and lead to functions claiming not to be reasonably applicable to a variable.
## using the same group var "USR_BP_0" for all resp_vars
names(varcomp_1)
## [1] "SummaryTable"           "SummaryData"            "ScalarValue_max_icc"   
## [4] "ScalarValue_argmax_icc"

Output: Summary table

The summary data frame is called using varcomp_1$SummaryTable:

Variables Object Model.Call ICC_acc_ud_loc Class.Number Mean.Class.Size Median.Class.Size Min.Class.Size Max.Class.Size convergence.problem GRADING
SBP_0 USR_BP_0 SBP_0 ~ AGE_0 + SEX_0 + (1 | USR_BP_0) 0.153 15 165.8 160 29 413 FALSE 1
DBP_0 USR_BP_0 DBP_0 ~ AGE_0 + SEX_0 + (1 | USR_BP_0) 0.172 15 165.0 162 28 413 FALSE 1

In addition to this table, some scalar values are returned (“ScalarValue_max_icc”, “ScalarValue_argmax_icc”) which represent the highest proportion ICC/VC and the response variable with the highest ICC/VC.

Interpretation

ICC or the analysis of variance components should be applied in combination with MARGINS. Extended tests showed that ICC is less susceptible to false-positive indications of data quality issues than margins.

Algorithm of the implementation

  1. Missing codes are removed from resp_vars (if defined in the metadata).
  2. Deviations from limits, as defined in the metadata, are removed.
  3. A linear mixed-effects model is estimated for resp_vars using co_vars and group_vars for adjustment.
  4. An output data frame is generated for group_vars indicating the ICC.

Limitations

Sufficient numbers of observations within each level of the group_vars are required. This can be specified by the formal min_obs_level. Nevertheless, the algorithm of the linear mixed effects model may not converge in cases of imbalanced and low numbers of observations.

Concept relations

Koo, T.K., and Li, M.Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine 15, 155–163.
Müller, R., and Büttner, P. (1994). A critical discussion of intraclass correlation coefficients. Statistics in Medicine 13, 2465–2476.
Sniders, T., and Bosker, R. (1999). Multilevel analysis: An introduction to basic and advanced multilevel modeling. (Sage-Publications).