Description

A classical but still popular approach to detect univariate outlier is the boxplot method introduced by Tukey 1977. The boxplot is a simple graphical tool to display information about continuous univariate data (e.g., median, lower and upper quartile). Outliers are defined as values deviating more than \(1.5 * IQR\) from the 1st (\(Q_{25}\)) or 3rd (\(Q_{75}\)) quartile. The strength of Tukey’s method is that it makes no distributional assumptions and thus is also applicable to skewed or non mound-shaped data Seo, 2006,. Nevertheless, this method tends to identify frequent measurements which are falsely interpreted as true outliers.

A somewhat more conservative approach in terms of symmetric and/or normal distributions is the 3 standard deviation (SD) method, i.e. any measurement not in the interval of \(\bar{x} \pm 3*SD\) is considered an outlier.

Both methods mentioned above are not ideally suited to skewed distributions. As many biomarkers such as laboratory measurements represent in skewed distributions the methods above may be insufficient. The approach of Hubert and Vandervieren 2008 adjusts the boxplot for the skewness of the distribution. This approach is implemented in several R packages such as robustbase which is used in this implementation of dataquieR.

Another completely heuristic approach is also included to identify outliers. The approach is based on the assumption that the distances between measurements of the same underlying distribution should be homogeneous. For comprehension of this approach: a) consider an ordered sequence of all measurements b) between these measurements all distances are calculated c) the occurrence of larger distances between two neighboring measurements may then indicate a distortion of the data. For the heuristic definition of a large distance \(1*\sigma\) has been been chosen.

In this way, the acc_robust_univariate_outlier function is an implementation of the Univariate outliers indicator, which belongs to the Unexpected distributions domain in the Accuracy dimension.

For more details, see the user’s manual, source code.

Usage and arguments

acc_robust_univariate_outlier(
  resp_vars = NULL,
  label_col = NULL,
  study_data = sd1,
  meta_data = md1,
  exclude_roles = NULL,
  n_rules = 4,
  max_non_outliers_plot = 10000
)

The function has the following arguments:

  • study_data: mandatory, the data frame containing the measurements.
  • meta_data: mandatory, the data frame containing the study data’s metadata.
  • resp_vars: mandatory, a character specifying the measurement variable of interest. The variable must be of float or integer type.
  • label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.
  • exclude_roles: optional, a character (vector) of variable roles not included.
  • n_rules: optional, the number of rules that must be violated to classify as outlier.
  • max_non_outliers_plot: optional, an integer (default = 10000) specifying the maximum number of observations (being not classified as outlier) used in the plots, relevant for large data to reduce plot size. CAVEAT: if this formal is used, the ggplot output will contain less observations than the original data.

The function is designed for unimodal data only and does not use thresholds other than defined by the applied methods. See Description for details.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

For the acc_robust_univariate_outlier function, the columns DATA_TYPE and MISSING_LIST in the metadata are relevant.

VAR_NAMES LABEL MISSING_LIST DATA_TYPE
3 v00002 SEX_0 NA integer
4 v00003 AGE_0 NA integer
6 v01003 AGE_1 NA integer
7 v01002 SEX_1 NA integer
15 v00109 ARM_CIRC_DISC_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 integer
16 v00010 ARM_CUFF_0 99980 | 99987 integer
19 v00013 EXAM_DT_0 NA datetime
24 v00017 LAB_DT_0 NA datetime
26 v00018 EDUCATION_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
27 v01018 EDUCATION_1 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
31 v00022 EATING_PREFS_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
32 v00023 MEAT_CONS_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
33 v00024 SMOKING_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
34 v00025 SMOKE_SHOP_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer
38 v00029 PREGNANT_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 integer


This example specifies the analyses of univariate outliers for the complete dataset:

univ_outlier_1 <- acc_robust_univariate_outlier(
  resp_vars = NULL,
  label_col = "LABEL",
  study_data = sd1,
  meta_data = md1
)

The summary table of this function is called using univ_outlier_1$SummaryTable.

Variables Mean No.records SD Median Skewness Tukey (N) 3SD (N) Hubert (N) Sigma-gap (N) NUM_acc_ud_outlu Outliers, low (N) Outliers, high (N) GRADING PCT_acc_ud_outlu
AGE_0 49.91 2940 4.42 50.00 0.00 11 2 11 0 0 0 0 0 0.00
AGE_1 49.87 2940 4.43 50.00 0.00 11 1 11 0 0 0 0 0 0.00
SBP_0 126.52 2561 9.61 127.00 0.00 12 5 12 0 0 0 0 0 0.00
DBP_0 81.29 2544 9.21 81.00 0.00 14 3 14 0 0 0 0 0 0.00
GLOBAL_HEALTH_VAS_0 5.03 2618 2.92 5.00 0.02 0 0 0 0 0 0 0 0 0.00
ARM_CIRC_0 25.03 2657 3.96 25.00 0.00 4 9 4 0 0 0 0 0 0.00
CRP_0 2.89 2699 1.81 2.59 0.16 66 27 12 0 0 0 0 0 0.00
BSG_0 14.86 2686 12.13 11.00 0.33 93 42 93 1 1 0 1 1 0.04
DEV_NO_0 2.76 2692 1.35 3.00 0.00 0 0 0 0 0 0 0 0 0.00
N_CHILD_0 2.50 2336 1.53 2.00 0.33 32 8 173 0 0 0 0 0 0.00
N_INJURIES_0 4.59 2199 2.42 4.00 0.20 38 20 30 0 0 0 0 0 0.00
N_BIRTH_0 3.46 1099 1.77 3.00 0.20 27 5 30 1 1 0 1 1 0.09
N_ATC_CODES_0 2.26 2058 2.73 1.00 0.50 121 39 0 2 0 0 0 0 0.00
ITEM_1_0 3.04 2248 1.76 3.00 0.00 34 12 34 0 0 0 0 0 0.00
ITEM_2_0 2.99 2197 1.70 3.00 0.00 24 5 24 0 0 0 0 0 0.00
ITEM_3_0 3.01 2184 1.72 3.00 0.00 26 7 26 0 0 0 0 0 0.00
ITEM_4_0 3.00 2143 1.72 3.00 0.00 32 8 32 0 0 0 0 0 0.00
ITEM_5_0 6.02 2074 2.37 6.00 0.00 0 0 0 0 0 0 0 0 0.00
ITEM_6_0 5.95 2048 2.37 6.00 0.00 0 0 0 0 0 0 0 0 0.00
ITEM_7_0 6.04 2068 2.40 6.00 0.00 0 0 0 0 0 0 0 0 0.00
ITEM_8_0 5.89 2013 2.40 6.00 0.00 0 0 0 0 0 0 0 0 0.00


The respective plot list is obtained by univ_outlier_1$SummaryPlotList:

Only selected output is shown to reduce the size of this file.

Reduced plot size

In this example the plot size - or more accurately, the number of plotted observations - is reduced by setting max_non_outliers_plot = 500. The function samples n=500 observations from those being not outliers. This might be beneficial to reduce plotting times and to reduce plot size in rendered documents.

univ_outlier_2 <- acc_robust_univariate_outlier(
  resp_vars = NULL,
  label_col = "LABEL",
  study_data = sd1,
  meta_data = md1,
  max_non_outliers_plot = 500
)

Interpretation

Statistical outliers do not necessarily represent implausible measurements. It is up to the user how outliers are handled.

Algorithm of the implementation

  1. Select all variables of type float in the study data
  2. Remove missing codes from the study data (if defined in the metadata)
  3. Remove measurements deviating from limits defined in the metadata
  4. Identify outlier according to the approaches of Tukey (Tukey 1977), 3SD method (Saleem et al. 2021), Hubert (Hubert and Vandervieren 2008), and SigmaGap (heuristic)
  5. A output data frame is generated which indicates the no. of possible outlier, the direction of deviations (to low, to high) for all methods and a summary score which sums up the deviations of the different rules
  6. A scatter plot is generated for all examined variables, flagging observations according to the no. of violated rules (step 5).

Limitations

This implementation uses several ways to identify outliers but is not comprehensive, i.e. there exist further methods in this manner.

This function has still some deficits. For example, the formal n_rules considers currently only the number of violated rules. This functionality will be replaced by providing the possibility to select specific outlier rules in a next release. Further, this implementation can be applied on discrete data elements. In some cases this will not make sense, i.e. the meaningful application depends on user discretion.

Concept relations

Hubert, M., and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Computational Statistics & Data Analysis 52, 5186–5201.
Saleem, S., Aslam, M., and Shaukat, M.R. (2021). A review and empirical comparison of univariate outlier detection methods. Pakistan Journal of Statistics 37.
Seo, S. (2006). A review and comparison of methods for detecting outliers in univariate data sets.
Tukey, J.W. (1977). Exploratory data analysis (Addison-Wesley).