Survey of Outlier Detection Methods for Univariate Data

Timmy Chan
14 min readNov 26, 2023

--

When conducting exploratory data analysis, one of the first steps towards coherent understanding involves outlier detection. The process of deciding which data points are outliers and what to do to accommodate outliers depends on the analyst’s subjective judgement. Approximately two hundred years ago, researchers began to shift from labeling or removing outliers using ad hoc decisions based on their intuition to developing standardized methods.

The process for detecting outliers is an active area of research. The focus of this literature review is to provide a survey of common statistical methods to label potential outliers in extant literature. We also restrict our discussion here for unimodal data, since detecting multi-modal outliers requires much more human intervention, and the goal is to use this literature review to inform automating labeling of data points as outliers. In this article, we first give an operational definition of outlier along with common causes. In section two, we discuss classical methods.
In section three we discuss the modern robust extensions, and we end our examination of the literature with estimators of skew and dealing with non-symmetric data.

Because multiple justifiable choices are available to researchers, the question of how to manage outliers is a source of flexibility in the data analysis. To prevent the inflation of Type I error rates, it is essential to specify how to manage outliers following a priori criteria before looking at the data. For this reason, researchers have stressed the importance of specifying how outliers will be dealt with ‘specifically, precisely, and exhaustively’ [14, 22].

1.1 Search Criterion

We begin with a search on scholar.google.com for “Outlier Detection”, focusing on books and journals in computer science, statistics, engineering, and quality control. From the first ten results, we followed the citations on univariate techniques until we reached saturation. Along the way, we expanded our search criterion to include “univariate” and “anomaly detection”. Further explorations led to the inclusion of “skewed data”, and “robust estimators” of scale and skew.

1.2 Definition of Outlier

First we operationally define outliers based on the definitions by the American Society of Quality Control [13], as an observation (or subset of observations) that “appears to be inconsistent with the remainder of that set of data” [3], “deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” [10]. Outliers can be thought of as “a collective to refer to either a contaminant or a discordant observation”, where a discordant observation is “any observation that appears surprising or discrepant to the investigator” and a contaminant is “any observation that is not a realization from the target distribution” [4]. These definitions allow for some subjectivity in researcher decisions, which means that sharing the procedures used for outlier detection explicitly is recommended for full replicability.

Sources of outliers:

  1. Gross recording or measurement errors,
  2. Incorrect distributional assumption,
  3. Data might contain more structure than is being used (such as subsets of data may be from different times, etc.),
  4. Unusual observation that may lead to a useful discovery [13].

1.3 Challenges to Detecting Multiple Outliers

Data sets with multiple outliers or clusters of outliers are subject to masking and swamping effects. Although not mathematically rigorous, the following definitions from Acuna and Rodriguez in their 2004 paper give an intuitive
summary [2, 8]:

  • Masking effect: It is said that an outlier masks a second one close by if the latter can be considered outlier by itself, but no any more if it is considered along with the first one. Equivalently after the deletion of one outlier other instance may emerge as an outlier. Masking occurs when a group of outlying points skews the mean and covariance estimates toward it, and the resulting distance of the outlying point from the mean is small.
  • Swamping effect: It is said that an outlier swamps other instance if the latter can be considered outlier only under the presence of the first one. In other words after the deletion of one outlier other outlier may become a “good” instance. Swamping occurs when a group of outlying instances skews the mean and covariance estimates toward it and away from other “good” instances, and the resulting distance from these “good” points to the mean is large making them look like outliers.

1.3.1 Multimodality
The methods discussed in this article are restricted to be about unimodal distributions. Multimodality in a sample, especially in quality control or manufacturing, may hint at inspection errors or contamination of the data. This usually involves a more in-depth root cause analysis with qualitative methods. There does exist methods for multimodal outlier detections, though the methods for inter-modal outliers may be an interesting topic of study in another article.

For practical purposes, often an automated test to signal potential multimodal distributions can then trigger
further human intervention or investigation. Hartigan and Hartigan developed a test for multimodality in 1985 [9].
The implementation is in R and in Python [1].

2 Classical Methods for Outlier Labeling

2.1 Z-scores
The z-score method is commonly used, where data points more than k standard deviations away from the mean are
considered possible outliers. To measure “the number of standard deviations away from the mean”, each data point
is assigned a z-score defined as

where μ is the sample mean and σ is the sample standard deviation.
However, this method first assumes that the underlying distribution is Gaussian X ∼ N(μ, σ2). Normality is often simply assumed by data analysts in the exploratory step, which can lead to uncogent arguments and faulty
analysis.

Furthermore, “although the basic idea of using Z-scores is a good one, they are unsatisfactory because the summaries, mean of X and standard deviation of X, are not resistant. A resistant summary or estimator is not unduly affected by a few unusual observations” [13]. Clearly, mean of X and standard deviation of X can be greatly affected by even one single
outlier. To see how z-scores are impacted by sample size and outliers, we construct the following extreme case.

2.1.1 Maximum Z-Score Depends on n

Consider when x_1, x_2, . . . , x_{n−1} = k and x_n = k + nd for some real k and d. Then the mean of the sample becomes k + d:

by this logic, one data point can severely impact ̄x. And continuing this case, we also see that for the standard deviation:

Finally, we can see that the maximum z-score is dependent only on n when we consider the worst case data point in this construction, which is x_n = k + nd:

This also implies that z-score has an upper-bound that is dependent on n, making this method unable to detect outliers for n ≤ 10 since maximum z-score < 3 ∀ x ∈ X; thus, other methods must be used when dealing with outliers in small samples [13]. table 1 shows the theoretical maximum z-scores, for sample size n:

Table 1: Theoretical Maximums for Z-Scores

Example 2.1. Consider an example where the decimal point was forgotten in a sample of size 10 in table 2:

Table 2: Example with Just 10 Data Points

While 8895 is clearly a mistake (forgetting a decimal point), the z-score method is incapable of detecting it as an
outlier.

3 Robust Estimators of Scale and Outlier Labeling

Robustness of an estimator in statistics, in laymen’s terms, is the ability for the estimator to perform well with some outliers. More precisely, there are two notions of robustness in statistics: resistance and efficiency [6, 16].

  • Resistance means that modifying a small part of the observations, even by a very large amount, does not cause the estimate to change by a large amount as well. This property is usually measured by the so-called breakdown point, which is defined as the fraction of incorrect observations (with arbitrarily large values) that an estimator can handle before providing an incorrect (arbitrarily large) estimate.
  • Efficiency means that an efficient statistic provides an estimate very close to its optimal value when the underlying distribution of the observations is known; from this perspective, a robust statistic provides highefficiency under many different conditions and it needs fewer observations than a less efficient one to achieve a given performance.

3.1 Inter-quartile Range and Tukey’s Boxplot

The interquartile range method is attributed to Tukey in 1977 [20]. One calculates IQR = Q3 − Q1, the distance from the first to the third quartile, and any data point xi satisfying x_i < Q1 − k(IQR) or x_i > Q3 + k(IQR) are
labeled as outliers. By convention k = 1.5 is used to label possible outliers, and k = 3 is used to label “far out” data points. While this method is more robust than z-scores (outliers do not influence quartiles as intensely), this
method is limited in that it does not perform well on skewed data.

In 1986, Hoaglin, Iglewicz and Tukey [11], there was an extensive study on this method, and Hoaglin and Iglewicz summarizes this study on IQR in their 1993 text as follows:

Because about one in four box-plots of normal data contains at least one false outlier, the box-plot outlier-labeling rule provides only an exploratory tool. Observations flagged as outside require further study to determine their causes. This rule should not be used alone to declare outside observations as defective. The k = 3 rule, on the other hand, ordinarily does provide a conservative test for outliers. Although the boxplot rule is not the most efficient way to label outliers, it has valuable features. The most important aspect of this rule is that it routinely labels outliers as part of a standard univariate data summary. As a consequence, the investigator gets an early warning to deal with the outside values and attempts to explain why they occurred [13].

Another limitation of box-plots is that they assume near symmetric data. “Note that the box-plot assumes symmetry, since we add the same amount to Q_3 as what we subtract from Q_1. At asymmetric distributions the
usual box-plot typically flags many regular data points as outliers. The skewness-adjusted box plot corrects for this by using a robust measure of skewness in determining the fence” [19].

3.2 Median Absolute Deviation

Let ̃x denote the median of x. The construction of this method originally traces back to Gauss. Essentially, the median is a much more robust estimator of central location, with a breakdown point of 50% where mean has a breakdown point of 1/n → 0%. Thus, we replace the mean with the median as the measure of central tendency.

Furthermore, we replace the standard deviation in the z-score calculation with

and use the modified z-score of x_i

Note, as n → ∞, E(MAD) = 0.6745σ, so this value is chosen.
Remark: some authors construct this method with

MAD = 1.4825 median(|xi − ̃x|)

and Mi = (xi − ̃x)/MAD, which is equivalent because 0.6745 = 1/(1.4825) [18]. Furthermore, in implementation often the scores are absolute values, since this technique uses |Mi | > 3 as the fence for labeling outliers.

3.3 The Estimator Qn

Croux and Rousseeuw define Qn estimator as

That is, we take the k-th order statistic of the n choose 2 inter-point distances. Croux and Rousseeuw constructed a O(n)-space, O(n log n)-time algorithm which significantly outperforms the naive algorithm which
requires O(n2) for both space and time. […] we see that Qn, is considerably more efficient than MAD. But Qn looses some of its efficiency at small sample sizes [7, 18].

The fast algorithm is implemented in statsmodels.robust.scale.qn_scale [17].

4 Asymmetry: Robust Estimator of Skew and Adjusted Boxplot

To understand a sample and identify outliers, we must not only have robust estimations of location (median) and scale (MAD), we have to consider if the distribution is symmetric. Traditionally skewness γ_1 of a random variable X is the third standard moment:

Similar to the classical estimators of location (mean) and scale (standard deviation), the traditional skewness measures are highly sensitive to outliers.

Many textbooks teach a rule of thumb stating that the mean is right of the median under right skew, and left of the median under left skew. This rule fails with surprising frequency. It can fail in multimodal distributions, or in distributions where one tail is long but the other is heavy. Most commonly, though, the rule fails in discrete distributions where the areas to the left and right of the median are not equal. Such distributions not only contradict the textbook relationship between mean, median, and skew, they also contradict the textbook interpretation of the median [21].

In 2003, Brys, Hubert and Struyf introduced the medcouple (MC) as a robust measure of skewness which lead to the adjusted box plot proposed by Hubert and Vandervieren in 2008 [5, 12].

4.1 Medcouple

Given X = {x_1, x_2, . . . , x_n} is an independently sampled n observations from a continuous univariate distribution.
Without loss of generality, we sort the samples so that x_1 ≤ x_2 ≤ · · · ≤ x_n. Then

for the special cases where x_i = x_j = ̃x, the kernel is defined as follows: let m_1 < · · · < m_k be the indices that are
tied to the median of X. That is,

Notice here that the number of zeros added equals the number of data values tied to the median, matching with the intuition that many points equal to the median decrease the skewness of a distribution. The first and third cases are added mainly for computational efficiency in the fast algorithms described by the authors [5].

While the fast algorithm is implemented in R in the robustbase package and a C extention for Python robustats, at the time of writing the naïve algorithms are implemented in Matlab and statsmodels package in
Python [15].

4.2 Adjusted Boxplot

While there were attempts to modify the boxplot using lower and upper semi-interquartile range (Q_2 − Q_1 for lower SIQR and Q_3 − Q_2 for upper SIQR), this proved to be insufficient in simulations and in practice. Hubert and Vandervieren construct the following adjusted boxplot [12]. one first calculates the medcouple (hopefully using the fast algorithm), then, use the following new definitions of fences:

Note that when there is no skew, MC = 0, we arrive again at the traditional boxplot.

5 Conclusion

This short literature review gives a summary of the classical methods of outlier detection, as well as robust methods for univariate data in extant research. The push towards robust statistics is not only in the realm of engineering and quality control, but has significant applications to psychology and other fields that often deal with non-normal data
or data with small sample sizes. We summarize the procedure, based on the limitations of each statistic in fig. 1.

Figure 1: Recommended Procedure for Outlier Detection using Robust Statistics

The domain of “Outlier Detection” is an area of active research, especially in machine learning and artificial intelligence for analyzing more complex multivariate data. These robust univariate methods are also direct
inspirations for some of the multivariate techniques. “In many cases multivariable observations can not be detected as outliers when each variable is considered independently. Outlier detection [on these types of datasets] is possible only when multivariate analysis is performed, and the interactions among different variables are compared within the class of data” [8]. The techniques used for multi-dimensional data will be explored in a second literature review in the near future.

References

[1] url: https://pypi.org/project/diptest/.
[2] Edgar Acuna and Caroline Rodriguez. “A meta analysis study of outlier detection methods in classification.” In: Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez 1 (2004), p. 25.
[3] Vic Barnett and Toby Lewis. Outliers in Statistical Data. 2nd ed. New York: John Wiley & Sons, 1984.
[4] R. J. Beckman and R. D. Cook. “Outliers.” en. In: Technometrics 25.2 (May 1983), pp. 119–149. issn: 0040–1706, 1537–2723. doi: 10.1080/00401706.1983.10487840. url: http://www.tandfonline.com/doi/abs/10.1080/00401706.1983.10487840.
[5] G Brys, M Hubert, and A Struyf. “A Robust Measure of Skewness.” en. In: Journal of Computational and Graphical Statistics 13.4 (Dec. 2004), pp. 996–1017. issn: 1061–8600, 1537–2715. doi: 10.1198/106186004X12632. url: http://www.tandfonline.com/doi/abs/10.1198/106186004X12632.
[6] Massimo Cafaro et al. “Fast online computation of the Qn estimator with applications to the detection of outliers in data streams.” In: Expert Systems with Applications 164 (2021), p. 113831. issn: 0957–4174. doi: https://doi.org/10.1016/j.eswa.2020.113831. url: https://www.sciencedirect.com/science /article/pii/S0957417420306424.
[7] Christophe Croux and Peter J. Rousseeuw. “Time-Efficient Algorithms for Two Highly Robust Estimators of Scale.” In: Computational Statistics. Ed. by Yadolah Dodge and Joe Whittaker. Heidelberg: Physica-Verlag HD,
1992, pp. 411–428. isbn: 978–3–662–26811–7.
[8] Data mining: the textbook. 1st edition. New York, NY: Springer Science+Business Media, 2015. isbn: 9783319141411.
[9] J. A. Hartigan and P. M. Hartigan. “The Dip Test of Unimodality.” In: The Annals of Statistics 13.1 (1985), pp. 70–84. doi: 10.1214/aos/1176346577. url: https://doi.org/10.1214/aos/1176346577.
[10] D. M. Hawkins. Identification of Outliers. en. Dordrecht: Springer Netherlands, 1980. isbn: 9789401539968. doi:
10.1007/978–94–015–3994–4. url: http://link.springer.com/10.1007/978-94-015-3994-4.
[11] David C. Hoaglin, Boris Iglewicz, and John W. Tukey. “Performance of Some Resistant Rules for Outlier
Labeling.” en. In: Journal of the American Statistical Association 81.396 (Dec. 1986), pp. 991–999. issn: 0162–1459,
1537–274X. doi: 10.1080/01621459.1986.10478363. url: http://www.tandfonline.com/doi/abs/10.1080/
01621459.1986.10478363 (visited on 10/11/2023).
[12] M. Hubert and E. Vandervieren. “An adjusted boxplot for skewed distributions.” en. In: Computational Statistics
& Data Analysis 52.12 (Aug. 2008), pp. 5186–5201. issn: 01679473. doi: 10.1016/j.csda.2007.11.008. url:
https://linkinghub.elsevier.com/retrieve/pii/S0167947307004434 (visited on 10/11/2023).
[13] Boris Iglewicz and David C. Hoaglin. How to detect and handle outliers. ASQC basic references in quality control
v. 16. Milwaukee, Wis: ASQC Quality Press, 1993. isbn: 9780873892476.
[14] Christophe Leys et al. “How to Classify, Detect, and Manage Univariate and Multivariate Outliers, With
Emphasis on Pre-Registration.” en. In: International Review of Social Psychology 32.1 (Apr. 2019), p. 5. issn:
2119–4130. doi: 10.5334/irsp.289. url: http://www.rips-irsp.com/articles/10.5334/irsp.289/ (visited on 10/11/2023).
[15] Medcouple. en. Page Version ID: 1149316915. Apr. 2023. url: https://en.wikipedia.org/w/index.php?title=Medcouple&oldid=1149316915.
[16] Frederick Mosteller and John W. Tukey. Data analysis and regression. A second course in statistics. 1977.
[17] Josef Perktold et al. statsmodels/statsmodels: Release 0.14.0. May 2023. doi: 10.5281/ZENODO.593847. url: https://zenodo.org/record/593847.
[18] Peter J. Rousseeuw and Christophe Croux. “Alternatives to the Median Absolute Deviation.” en. In: Journal of the American Statistical Association 88.424 (Dec. 1993), pp. 1273–1283. issn: 0162–1459, 1537–274X. doi:
10.1080/01621459.1993.10476408. url: http://www.tandfonline.com/doi/abs/10.1080/01621459.1993.10476408 (visited on 10/11/2023).
[19] Peter J. Rousseeuw and Mia Hubert. “Anomaly detection by robust statistics.” en. In: WIREs Data Mining and Knowledge Discovery 8.2 (Mar. 2018), e1236. issn: 1942–4787, 1942–4795. doi: 10.1002/widm.1236. url:
https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1236 (visited on 10/11/2023).
[20] John Wilder Tukey. Exploratory data analysis. Addison-Wesley series in behavioral science. Reading, Mass: Addison-Wesley Pub. Co, 1977. isbn: 9780201076165.
[21] Paul T. Von Hippel. “Mean, Median, and Skew: Correcting a Textbook Rule.” en. In: Journal of Statistics Education 13.2 (Jan. 2005), p. 3. issn: 1069–1898. doi: 10.1080/10691898.2005.11910556. url: https://www.tandfonline.com/doi/full/10.1080/10691898.2005.11910556.
[22] Jelte M. Wicherts et al. “Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking.” In: Frontiers in Psychology 7 (Nov. 2016). issn: 1664–1078. doi:
10.3389/fpsyg.2016.01832. url: http://journal.frontiersin.org/article/10.3389/fpsyg.2016.01832/full (visited on 10/11/2023).

--

--

Timmy Chan
Timmy Chan

Written by Timmy Chan

Professional Software Engineer, Master Mathematician interested in learning and implementing multidisciplinary approaches to complex questions

No responses yet