Structural data set error includes the indicators: Unexpected data element count, Unexpected data element set, Unexpected data record count, Unexpected data record set, and Duplicates. These data quality indicators can be applied at the data frame level or at the segment level, and they are implemented in the functions int_all_datastructure_dataframe and int_all_datastructure_segment, respectively.

Data frame level

Structural data set error at the data frame level can be assessed using:

# Load dataquieR
library(dataquieR)

# Load data
sd1 <- prep_get_data_frame("ship")

# Load metadata
file_name <- system.file("extdata", "ship_meta_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(file_name)
meta_data_item <- prep_get_data_frame("item_level") # item_level is a sheet in ship_meta_v2.xlsx
meta_data_dataframe <- prep_get_data_frame("dataframe_level") # dataframe_level is a another sheet in ship_meta_v2.xlsx

# Apply indicator function
dataframe_structure <- int_all_datastructure_dataframe(
   meta_data_dataframe = meta_data_dataframe,
   meta_data = meta_data_item
)

The function returns a nested list with the elements DataframeTable and DataframeDataList. DataframeTable is used for reporting purposes, so the results are abbreviated. Hence, here we focus on the readable output from DataframeDataList. This list contains six data frames (one or more per indicator indicator), each with a Data frame column, which indicates the name of each study database analyzed. The data frames are:

Unexpected data element count

dataframe_structure$DataframeDataList$`Unexpected data element count`

Check	Data frame	Unexpected elements	Number of elements in data	Number of elements in metadata	Number of mismatches	Percentage of mismatches	GRADING
Elements	ship	TRUE	33	29	4	13.793	1

The columns indicate whether unexpected elements (e.g., variables) were found, the number of elements present in the study data, the number of elements in the metadata, and, if unexpected elements are detected, the number and percentage of mismatches is reported. According to this result, a binary GRADING is also provided to flag any discrepancy. In this case, GRADING = 1 means that there is a mismatch.

Unexpected data element set

dataframe_structure$DataframeDataList$`Unexpected data element set`

MISSING	Unexpected data element set: Percentage (0 to 100)	Unexpected data element set: Number	resp_vars	GRADING	Data frame
0	0	0	0	0	ship

If there is an unexpected element set, the column MISSING indicates whether it is missing from the study data or the metadata. The next columns show the percentage and number of unexpected element sets, respectively, while resp_vars contains the names of the affected elements. Note that the table above shows only zeros because no unexpected elements were identified, so GRADING = 0.

Unexpected data record count

dataframe_structure$DataframeDataList$`Unexpected data record count`

Check	Data frame	Unexpected records	Number of records in data	Number of records in metadata	Number of mismatches	Percentage of mismatches	GRADING
Records	ship	FALSE	2154	2154	0	0	0

The columns indicate the number of records expected according to the metadata, the actual number of records present in the study data, and, if unexpected records are detected, the number and percentage of mismatches is reported. Here, there is a perfect match, so GRADING = 0.

Unexpected data record set

dataframe_structure$DataframeDataList$`Unexpected data record set`

Check	Data frame	Unexpected records in set	Number of records in data	Number of records in metadata	Number of mismatches	Percentage of mismatches	Expected match type	Actual match type	GRADING
Record set	ship	FALSE	2154	2154	0	0	exact	exact	0

In this data frame, the columns show the number of records expected according to the metadata, the actual number of records present in the study data, and, if unexpected records are detected, the number and percentage of mismatches is reported. In this example, GRADING = 0 because no unexpected records were found.

Duplicates

ID duplicates

dataframe_structure$DataframeDataList$Duplicates

Check	Data frame	Any duplicates	Number of duplicates	Percentage of duplicates	GRADING
IDs	ship	FALSE	0	0	0

These results are based on IDs. The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. According to the result of the assessment, GRADING = 0 as there are no duplicate IDs.

Row duplicates

dataframe_structure$DataframeDataList$int_sts_dupl_row

Check	Data frame	Any duplicates	Number of duplicates	Percentage of duplicates	GRADING
Duplicates	ship	FALSE	0	0	0

These results are based on row content (i.e. the uniqueness of rows in the study data). The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. GRADING = 0 in this case because there are no duplicates.

Segment level

To evaluate Structural data set error at the segment level, we apply the function int_all_datastructure_segment in the following way:

# Load dataquieR
library(dataquieR)

# Load data
sd1 <- prep_get_data_frame("ship")

# Load metadata
file_name <- system.file("extdata", "ship_meta_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(file_name)
meta_data_item <- prep_get_data_frame("item_level") # item_level is a sheet in ship_meta_v2.xlsx
meta_data_segment <- prep_get_data_frame("segment_level") # segment_level is a another sheet in ship_meta_v2.xlsx

# Apply indicator function
segment_structure <- int_all_datastructure_segment(
  study_data = sd1,
  meta_data = meta_data_item,
  meta_data_segment = meta_data_segment
)

The function returns a nested list with the elements SegmentTable, SegmentData and SegmentDataList. SegmentTable is used for reporting purposes, so the results are abbreviated. Hence, here we focus on the readable output from SegmentData and SegmentDataList. SegmentData shows a summary of all the indicators computed per segment:

segment_structure$SegmentData

Segment	Unexpected data record count N (%)	Unexpected data record count (Grading)	Unexpected data record set N (%)	Unexpected data record set (Grading)
INTERVIEW	4 (0.19%)	1	1 (0.05%)	1
INTRO	0 (0%)	0	1 (0.05%)	1
LABORATORY	1640 (328%)	1	1 (0.2%)	0
SOMATOMETRY	1653 (330.6%)	1	1 (0.2%)	1

Note that a Grading is given per indicator and segment to show whether there are data quality issues (Grading = 1) or not (Grading = 0).

SegmentDataList contains a more detailed output with six data frames (one or more per indicator), each with a Segment column, indicating the name of each part of the study. The data frames are the following:

Unexpected data record count

segment_structure$SegmentDataList$`Unexpected data record count`

Segment	Check	Unexpected records	Number of records in data	Number of records in metadata	Number of mismatches	Percentage of mismatches	GRADING
INTRO	Records	FALSE	2154	2154	0	0.000	0
SOMATOMETRY	Records	TRUE	2153	500	1653	330.600	1
INTERVIEW	Records	TRUE	2154	2150	4	0.186	1
LABORATORY	Records	TRUE	2140	500	1640	328.000	1

The table reports the level of the check (in this case only records are relevant), whether unexpected records were found, the number of records present in the study data, the number of records expected according to the metadata, and, if unexpected records are detected, the number and percentage of mismatches. According to this result, a binary GRADING is also provided. Here, only the INTRO segment agrees with the expectations provided in the metadata, so GRADING = 0.

ID duplicates

segment_structure$SegmentDataList$Duplicates

Check	Segment	Any duplicates
IDs	INTRO	FALSE
IDs	SOMATOMETRY	FALSE
IDs	INTERVIEW	FALSE
IDs	LABORATORY	FALSE

Row duplicates

segment_structure$SegmentDataList$int_sts_dupl_content

Check	Segment	Any duplicates
Duplicate records	INTRO	FALSE
Duplicate records	SOMATOMETRY	FALSE
Duplicate records	INTERVIEW	FALSE
Duplicate records	LABORATORY	FALSE

Unexpected data element set

segment_structure$SegmentDataList$`Unexpected data element set`

Segment	MISSING	resp_vars
INTRO	NA	NA
SOMATOMETRY	NA	NA
INTERVIEW	NA	NA
LABORATORY	NA	NA

If there is an unexpected element set, the column MISSING indicates whether it is missing from the study data or the metadata. The next columns show the percentage and number of unexpected element sets, respectively, while resp_vars contains the names of the affected elements. According to the presence of unexpected element sets, a binary GRADING is also provided to flag the discrepancies.

Back to Example data quality assessment of SHIP data