Boxplots and Outside Values
“The normal distribution, for example, is clearly the template for the selection of fence locations.” Dr. James Thompson, The Age of Tukey,Technometrics, August 2001.
The basic boxplot is a graphical representation of the five-number summary: the minimum, 25th percentile (also called first quartile), 50th percentile (the median, or second quartile), 75th percentile (or third quartile), and maximum values. You don’t need to assume this or that distribution in order to determine these five numbers – just sort the data and determine the appropriate percentiles.
Beyond the basic boxplot, however, is Dr. John Tukey’s exploratory data analysis (EDA) boxplot that includes the notion of “fences” and “outside values.” An outside value is a value which is below the lower or above the upper fence. Fine, but how are fences defined? First, note that the interquartile range (IQR) is defined as the difference between the 75th and 25th percentiles: That is, IQR = 75th percentile – 25th percentile. Also, note that there are at least two types of fences: inner and outer. Inner fences are defined as: lower inner fence = 25th percentile – 1.5*IQR and upper inner fence = 75th percentile + 1.5*IQR. Outer fences are defined as: lower outer fence = 25th percentile – 3*IQR and upper outer fence = 75th percentile + 3*IQR.
If you assume a Gaussian (normal) distribution, how can we interpret these fences? A Gaussian distribution’s 75th percentile corresponds to the mean + 0.6745 standard deviations, and its 25th percentile corresponds to the mean – 0.6745 standard deviations. This means the IQR represents 1.349 standard deviations. Inner fences represent mean +/- 2.698 standard deviations or 99.30% of the data, while outer fences represent mean +/- 4.7215 or 99.9998% of the data.
What if the data are Gaussian (normal) and the sample size is 1000? 99.3% of 1000 is 993, which suggests that we might see around 7 outside values. If these values are just outside the inner fences, we shouldn’t be surprised. However, if values are outside the outer fences then (given the sample size) we need to investigate further.
What if the underlying distribution is not symmetrical (say, lognormal)? Even with relatively small sample sizes, you shouldn’t be surprised to see outside values, and possibly even values outside the outer fences.