Definition

The degree to which the data conforms to structural and technical requirements.

Explanation

Integrity targets the fulfillment of structural and technical requirements on the data as a necessary precondition to conduct valid analyses. This comprises the successful conduct of data quality assessments as well as of substantive scientific analyses.

Integrity related analyses are guided by the question: Do all data comply with pre-specified formats and structures?

Integrity assessments make no reference to the correctness of data values as addressed within the dimensions consistency and accuracy. Missing data are only targeted insofar technically expected data structures are not present in the files or if a technically inferior way of coding such as system missing values are encountered.

There is some correspondence between “Processability” and the “Structuredness” dimension in the Weiskopf et al. 2017 data quality framework as well as “Conformance” in the Kahn et al. 2016and Lee et al. 2017 frameworks. However, the latter is wider in scope in targeting “Consistency” related aspects. The major distinction of “Integrity” to the other concepts rests in its role within a data quality pipeline workflow.

Example

For some data quality assessment, a study data set with 47 study variables is expected according to the provided metadata file. Yet, the study data set only comprises 40 out of the 47 specified variables. This discrepancy is targeted by the indicator “Unexpected data elements” within the dimension “Structural data set error”.

An addition, out of the 40 observed variables, the observed data type (float) does not match the expected data type according to metadata (integer). This deficiency is addressed by the indicator “Data type mismatch” within the domain “Value format error”.

Guidance

In a data quality assessment pipeline, checks within the “Integrity” dimension are to be conducted first. For this purpose, it should be ensured that appropriate study data and metadata files are available.

In case of findings data quality reports may no longer be trustworthy and all encountered issues must be remedied, if possible by adding the absent data structures and data values or by correcting data values. Afterwards the data quality reporting processes should be repeated.

Deficits within the “Integrity” dimension do not necessarily imply a lower data quality at the “completeness” or “correctness” level. “Integrity” related findings commonly lead to additional preprocessing steps to improve data sets and may frequently be resolved.

If “Integrity” issues cannot be resolved by using the intended data, missing structures should be added. For example, if expected data elements cannot be encountered without remedy, they should be filled with appropriate missing value codes to obtain valid missing value estimates by indicators within the completeness dimension.

Literature

Kahn MG, Callahan TJ, Barnard J, et al. A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. EGEMS (Wash DC). 2016;4(1):1244.
Kalton, Graham, and Daniel Kasprzyk. 1986. “The Treatment of Missing Survey Data.” Survey Methodology 12 (1): 1–16.
Lee K, Weiskopf N, Pathak J. A framework for data quality assessment in clinical research datasets. AMIA Annu Symp Proc 2017;2017:1080-9.
Weiskopf NG, Bakken S, Hripcsak G, Weng C. A Data Quality Assessment Guideline for Electronic Health Record Data Reuse. EGEMS (Wash DC). 2017;5(1):14.

Dimension “Integrity”

Definition

Explanation

Example

Guidance

Literature