dataquieR
This tutorial introduces the creation of data quality reports in R with dataquieR.
Creating reports requires the appropriate setup of study data and metadata, as shown in the figure below:
We can load the synthetic example data from dataquieR via the following:
load(system.file("extdata", "study_data.RData", package = "dataquieR"))
sd1 <- study_data
This example study data has 3000 observations and 53 variables:
sd1
v00000 | v00001 | v00002 | v00003 | v00004 | v00005 | v01003 | v01002 | v00103 | v00006 |
---|---|---|---|---|---|---|---|---|---|
3 | LEIIX715 | 0 | 49 | 127 | 77 | 49 | 0 | 40-49 | 3.8 |
1 | QHNKM456 | 0 | 47 | 114 | 76 | 47 | 0 | 40-49 | 1.9 |
1 | HTAOB589 | 0 | 50 | 114 | 71 | 50 | 0 | 50-59 | 0.8 |
5 | HNHFV585 | 0 | 48 | 120 | 65 | 48 | 0 | 40-49 | 3.8 |
1 | UTDLS949 | 0 | 56 | 119 | 78 | 56 | 0 | 50-59 | 4.1 |
5 | YQFGE692 | 1 | 47 | 133 | 81 | 47 | 1 | 40-49 | 9.5 |
1 | AVAEH932 | 0 | 53 | 114 | 78 | 53 | 0 | 50-59 | 5.0 |
3 | QDOPT378 | 1 | 48 | 116 | 86 | 48 | 1 | 40-49 | 9.6 |
3 | BMOAK786 | 0 | 44 | 115 | 71 | 44 | 0 | 40-49 | 2.0 |
5 | ZDKNF462 | 0 | 50 | 116 | 74 | 50 | 0 | 50-59 | 2.4 |
We can see that the study data variables have abstract names
(e.g. v00001, v00002
). Hence, the appropriate labels must
be mapped from the metadata. Besides all variables' data types and
labels, the metadata stores further expected characteristics and static
information about the study data.
We can read in the example metadata via the following:
load(system.file("extdata", "meta_data.RData", package = "dataquieR"))
md1 <- meta_data
md1
VAR_NAMES | LABEL | DATA_TYPE | VALUE_LABELS | MISSING_LIST | JUMP_LIST | HARD_LIMITS | DETECTION_LIMITS |
---|---|---|---|---|---|---|---|
v00000 | CENTER_0 | integer | 1 = Berlin | 2 = Hamburg | 3 = Leipzig | 4 = Cologne | 5 = Munich | NA | NA | NA | NA |
v00001 | PSEUDO_ID | string | NA | NA | NA | NA | NA |
v00002 | SEX_0 | integer | 0 = females | 1 = males | NA | NA | NA | NA |
v00003 | AGE_0 | integer | NA | NA | NA | [18;Inf) | NA |
v00103 | AGE_GROUP_0 | string | NA | NA | NA | NA | NA |
v01003 | AGE_1 | integer | NA | NA | NA | [18;Inf) | NA |
v01002 | SEX_1 | integer | 0 = females | 1 = males | NA | NA | NA | NA |
v10000 | PART_STUDY | integer | 0 = no | 1 = yes | NA | NA | NA | NA |
v00004 | SBP_0 | float | NA | 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 | NA | [80;180] | [0;265] |
v00005 | DBP_0 | float | NA | 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 | NA | [50;Inf) | [0;265] |
For more information on the synthetic example data and metadata, see here.
We can create a default report using the dq_report()
function, which requires only the data and metadata as input:
dq_report(study_data = sd1,
meta_data = md1)
The animation below shows a quick workflow for reporting data quality with dataquieR:
This example uses data from the Study of Health in Pomerania (SHIP)
project, which is also included in dataquieR. You can
see the example report generated by dq_report()
here.
The full code shown in the animation to produce a report is given here:
# --------------------------------------------------------------------------------------------------
# D A T A Q U A L I T Y I N E P I D E M I O L O G I C A L R E S E A R C H
#
# == dataquieR
#
# dq_report() eases the generation of data quality reports as it automatically calls dataquieR functions
#
#
# Installation/Further Information -----------------------------------------------------------------
#
# Please see our website:
# https://dataquality.ship-med.uni-greifswald.de/
#
# install dataquieR from CRAN using
install.packages("dataquieR")
# Alternatively, you may install the development version as described
# on https://dataquality.ship-med.uni-greifswald.de/DownloadR.html
# load the package
library(dataquieR)
# data ---------------------------------------------------------------------------------------------
# Study of Health in Pomerania example data
sd1 <- readRDS(system.file("extdata", "ship.RDS", package = "dataquieR"))
summary(sd1)
# metadata
md1 <- readRDS(system.file("extdata", "ship_meta.RDS", package = "dataquieR"))
# dq_report() - a crude approach -------------------------------------------------------------------
my_dq_report <- dq_report(study_data = sd1,
meta_data = md1,
label_col = LABEL)
# view the results
my_dq_report
The function dq_report()
can manage further arguments
and settings. However, this sparse version is a good start to gaining
insight into the data and may serve as the base to tailor more specific
reports.