We don’t always have the luxury of measuring an entire population, so we take samples. For good or for bad, there is plenty of advice when it comes to sampling. Sir R. A. Fisher said that if all the elements in your population are identical, you need only a sample size of one. John Tukey noted that in general you never improve your estimate of variability more than when you go from a sample size of one to a sample size of two. George Runger has said that he’d rather have a small sample of parts collected over a couple of weeks of production than a couple of thousand parts from one batch produced this afternoon. At the risk of making a sweeping generalization, we see a lot of work done with samples of size 30 or so. Is there something magical about a sample size of 30?

With a lot of determination and a little help from some friends (including Fisher), William Gosset (aka “Student”) developed the properties of the t distribution. Gosset’s work at the Guinness brewery in Dublin led him to look closely at what it means to rely on a small sample to estimate the population mean and the population standard deviation (I’m pretty sure he worked more with yeast than he did with beer, by the way). His key observation was that for small samples, there’s uncertainty not only in our estimate of the mean value but there can be considerable uncertainty in our estimate of the standard deviation. The probability distribution function of a t distribution looks similar to that of a normal distribution, but t distributions have (among other things) heavier tails (accounting for the uncertainty in the estimate of standard deviation). As sample sizes increase the t distribution look more and more like the normal (Z) distribution (infinite sample size and the t is the Z – that makes sense, as an infinite sample size means we’re looking at the population).

Introductory statistics books and classes often discuss the notion of Z-tests and t-tests to compare two sample means. T-tests are appropriate when the sample standard deviation is the best we’ve got as an estimate of the population sigma (actually, you could argue that t tests are always appropriate), while Z-tests are appropriate when the population standard deviation is known. For my own part, the last time I knew the population standard deviation was when it was given to me in a homework problem in an introductory statistics class, so it seems like the t-test is the way to go. Yet, we know there’s more to the story, including the fact that the normal distribution is just so convenient.

Many texts suggest than when the sample size is around 30 you can safely use the normal distribution instead of the t-distribution when computing, for example, a confidence interval on the mean (standard deviation unknown, but estimated from the data). An important point to note is that a distribution of sample averages tends toward the normal distribution as the sample size increases (look to the central limit theorem for guidance here). The key, of course, is that we’re talking about a distribution formed from average values, not individual values. A large sample size doesn’t turn (for example) a lognormal distribution into a normal distribution, it just helps turn a distribution of sample averages from a lognormal distribution into something that approaches a normal distribution.

Keith Bower has written: “Regarding n = 30, I’m fairly sure the prevalence is due to one of Egon Sharpe Pearson’s papers … which runs thru many simulations to assess robustness of t-tests with regard to some non-symmetry in the underlying distribution. I’m fairly sure it was Shewhart who had recommended … to investigate it, in some correspondence between the two. Statisticians should always preach the ‘it depends’ mantra though, as I’m sure Pearson would be the first to agree with.”

Again, having a sample size of 30 does not somehow magically turn the underlying distribution into a normal distribution. If your data are uniformly distributed or lognormally distributed or gamma or beta or whatever, then individual values are modeled quite differently as compared with a normal distribution. What does happen is that for many cases, a sample size of 30 gets you to a point where the difference between using Z and t is relatively small for such things as confidence intervals of the mean. However, it doesn’t guarantee that your estimate of the mean is somehow spot on. Even with a sample size of 30, a 95% confidence interval based on a sample mean of 100 and sample standard deviation of 10 is ~ 96.3 to 103.7 (if you use Z instead of t, the interval would be 96.4 to 103.6).

By the way, for that same parameter with estimated mean 100 and estimated standard deviation 10, a 95% CI for the capability index Cpk is around 1.00 to 1.67. From a PPM standpoint, that’s around 2700 to about 0.57. To put it nicely, that’s a difference of a few orders of magnitude. In a related vein, Somerville and Montgomery point out that the assumption of normality would lead to an underestimation of 1428 PPM in the case where Cpk is calculated to be 1.00 with the underlying distribution actually t with 30 degrees of freedom.

My wife used to work for a relatively large computer products manufacturer. She was in a meeting one day when the topic of proportion defective came up. That is, someone started a discussion about what sort of sample size would be needed to detect a particular proportion defective. For illustration, let’s say they were concerned about one particular kind of defect, and assume 1 out of 50 widgets had this issue. What would you infer about this problem if you took a random sample of 5 widgets? 50? 500? 5000? In the meeting, someone suggested that if you could take a ‘perfect’ sample of say 50 widgets and you saw that 1 of them had the defect then you knew that your proportion defective was exactly 2% in the population. That’s an interesting sentiment, but it is essentially meaningless. Sampling isn’t about perfection, it’s about practical ways of dealing with uncertainty.