Introduction

In large epidemiological studies, data may be provided in separate data frames. In this case, data frame level metadata needs to be specified to assess the data quality of the different data frames.

Data frame (DF) level metadata for data quality reporting

Currently, the following attributes can be used by dataquieR functions (with the exception of dq_report2, which is intended for a single study data frame):

DF_NAME

This column defines the names of the data frames to be assessed. The input must be a string, referring to a data frame in the data frame cache (prep_list_dataframes).

CAVEAT dq_report2 will only find and use the data frames that are in the data frame cache.

DF_ELEMENT_COUNT

Specifies the number of expected data elements (columns) in each study data frame. The value must be an integer. The check will only be conducted if a number is entered.

As an example, the metadata for the data frame element count may contain the following information:

DF_NAME	DF_ELEMENT_COUNT
study_data	53
lab_data	6
questionnaire_data	10

DF_RECORD_COUNT

Specifies the number of expected data records (rows) in each study data frame. The value must be an integer. The check will only be conducted if a number is entered.

For instance, the data frame level count metadata may be:

DF_NAME	DF_RECORD_COUNT
study_data	3000
lab_data	2500
questionnaire_data	2900

DF_ID_REF_TABLE

The name of the table containing the reference IDs to be compared with the IDs in the targeted data frame. The input must be a string and can refer to a spreadsheet in the same or another workbook or an URL.

In the example below, for the data frames study_data and lab_data, the IDs are specified in the sheet called expected_ids of the same workbook. In contrast, the IDs for the questionnaire_data are provided in the pseudo_id sheet of the questionnaire_data.xlsx workbook. Since this is a different workbook, its path must be specified.

DF_NAME	DF_ID_REF_TABLE
study_data	expected_id
lab_data	expected_id
questionnaire_data	d:/data/questionnaire_data.xlsx \| pseudo_id

DF_RECORD_CHECK

A string that sets the type of check to be conducted when comparing the reference ID table with the IDs in a segment. Two assessments are possible:

exact: tests for an exact match between DF_ID_REF_TABLE and the IDs in DF_NAME, or
subset: expects that the IDs in DF_NAME are a subset of DF_ID_REF_TABLE.

For instance, the study_data may comprise all participants from a study, while particular sections, such as lab_data or questionnaire_data, may have only been collected from a smaller participant sample:

DF_NAME	DF_RECORD_CHECK
study_data	exact
lab_data	subset
questionnaire_data	subset

DF_UNIQUE_ID

Defines expectancies on the uniqueness of the IDs across the rows of a data frame or the number of times an ID can be repeated. The input must be an integer defining the number of permissible repetitions (e.g., 1 equals uniqueness or no repetitions). Enter “-1” for unknown repetitions.

In many cases, we would not expect IDs to appear more than once, for example, if study_data contains information on all study participants only once. However, in some other cases values may be measured multiple times, for instance in lab_data three values may be measured per participant. Lastly, it may not be known whether there are expected repetitions in the data, as in questionnaire_data, identified with “-1”.

DF_NAME	DF_UNIQUE_ID
study_data	1
lab_data	3
questionnaire_data	-1

DF_ID_VARS

Defines all variables to be used as one single ID variable (a combined key) in a segment. The list of variables must be a string in which each variable is separated by a pipe character (|).

For example, the ID for study_data is specified in the variable “v00001”, while for lab_data is PSEUDO_ID. In some situations, the ID may be defined by a combined key specified by a list of variables, as in questionnaire_data, where the key consist of the “ID” and “exdate” variables.

DF_NAME	DF_ID_VARS
study_data	v00001
lab_data	PSEUDO_ID
questionnaire_data	id \| exdate

DF_UNIQUE_ROWS

Specifies whether identical data is permitted across rows in a data frame (excluding ID variables). The input is a boolean, meaning:

false: allow repeated rows, or
true: rows must be unique.

For instance, row repetitions may be allowed for lab_data but not for study_data and questionnaire_data.

DF_NAME	DF_UNIQUE_ROWS
study_data	true
lab_data	false
questionnaire_data	true

Definition and use of data frame level metadata