yellow brick road to stats heaven

~ a loose collection of statistical and quantitative research material for fun and enrichment ~

by roland b. stark

the most insidious statistical mistakes:
how significance tests distort our thinking

Insidious Mistakes - Word Cloud

yellowbrickstats home | my statistical and research consulting

"Hypothesis testing is admirably suited for the allaying of insecurities which might have been better left unallayed."

Jerome Cornfield

In my first few years studying statistics, I thought significance tests were fascinating, and remarkably useful. How nifty that one could take a real-life result in practically any field and test it to see how likely such a thing would occur by chance alone. And I eagerly learned all manner of methods for conducting these tests -- to compare group averages, to evaluate runs or streaks, to assess the strength of associations between variables, and so on.

In those days I never imagined I would become so disenchanted with the way people use these tests. I find that people overwhelmingly focus on significance/nonsignificance and fail to address the natural, more important next question, of How much? or How strong? or How different? It's remarkable how often people stop at the significance test and either gloss over or ignore completely the effect size, the parameter of interest. All over the mass media, in the vast majority of articles reporting quantitative results, this kind of thinking impoverishes discourse, in ways that might not be apparent at first glance.

Just a few examples:

I'd like to go into one such article in more detail.

"News flash: New research concludes that the sensationalism sweeping local news is bad for ratings"

So began an extensive piece covering nearly 1 1/2 pages of a Boston Sunday Globe.4 I'll grant that this author, after exploring the nuances, implications, and complications involved in the (unquestioned) finding about TV news ratings differences, finally did reveal the magnitude of this difference -- in a sidebar, paragraph 34 of 35. Before I present that difference, I'd like you to form a picture in your mind of what it might look like. Try to imagine two bell curves, one for the ratings of (the % of households watching) quality TV news stories and one for those of sensational stories. One curve will naturally be shifted to the right of the other, reflecting its significantly higher average.


The answer: the "strong correlation between high quality scores and high ratings" translated, at least in one class of results, into a mean difference of 0.25 ratings points, or 1/4 of 1% of viewing households. My graph below shows the size of this difference; the two narrowly separated dotted lines show the group means. The data are simulated.

News Ratings Graph

Yes, the upper and lower halves look the same. If one approach was "bad for ratings" this is barely visible to the naked eye. There is much, much more variability within each group than between the two groups.

"But why haven't you labeled the numbers on the y-axis?" That gets exactly to my point. The sample size matters for statistical significance, but for the magnitude and thus the meaningfulness of the group difference, it doesn't matter whether the y-scale runs from 0 to 10 or 0 to 10,000. The look of the group difference would be the same.


People tend to overestimate the magnitude of differences or correlations that are reported merely as "significant." The mind's tendency -- need, really -- to create patterns has been extremely well documented by anthropologists and psychologists. You'll see this in the excellent literature on perception of statistics and probability, exemplified by the writings of Tversky, Kahneman, Oakes, and Gigerenzer. If we are given a small tidbit such as "this is greater than that," we naturally create a convincing mental picture involving a substantial difference. If our mind's eye happens to draw paired histograms (as above) in response to such a factoid, well, the distributions don't overlap very much. Similarly, when we hear in shorthand about a correlational finding such as "fish diet linked with higher IQ," we don't imagine some barely visible correlation such as 0.2,

but something that would clearly show up on a scatterplot -- a 0.7 or higher.

I surprise myself by how often I fall into the same kind of thinking, even about my own research results. I'll come out with some finding, say, a correlation of .25 between two variables; I'll begin a conversation soberly characterizing it as "a slight but statistically significant correlation"; and within 10 minutes, perhaps to justify the fact that I'm reporting it at all, I'll slide into treating it as a meaningful or even an important connection.


Sometimes, our training in significance testing can lead to a curious contradiction in thinking. Very informally, I asked the following question of five experienced educational or social-science researchers:

What are the approximate odds that Candidate A will beat Candidate B if, in a quintessentially scientific poll, Candidate A leads by 1 percentage point, with a margin of error of +/- 5 points for the lead?

A) The odds are even. The sample lead is not statistically significant and therefore we should discount it.
B) 6 to 5
C) 3 to 2
D) 2 to 1

Most of us are familiar with the kind of reasonable thinking that goes, "without a statistically significant result, we can't put too much faith into there being much of a lead, and we couldn't reliably call it greater than zero." From there, it's unfortunately not such a big slide into the incorrect idea that "without a significant result, we can't say anything about what the lead is." In other words, if our p-value is inconclusive enough (large enough), by this latter distortion we couldn't even say whether Candidate A had anything other than a 50-50 chance of winning. Whereas, if we merely and naively noted the small lead that we observed in the sample, we would strongly tend to believe she has at least somewhat the better chance. Most people untainted with hypothesis testing would affirm this. Who would be willing to give Candidate B even odds in a bet? Or, more telling, to give even odds across a large number of comparable situations? Those who would are under the sway of the same kind of thinking that produces that especially insidious phrase, "statistical dead heat."

But many a competent student of introductory statistics could

Election Poll Graph

Alternatively, one could use a standard introductory textbook's Appendix with Area under the Normal Curve to show that

(It may not be immediately obvious why this should be a one-tailed test. But it wouldn't make much sense to conduct a two-tailed test, hypothesizing that Candidate A's lead is exactly zero.)

What 4 out of my 5 researcher-subjects said in essence was "I was going to favor Candidate A's odds, but of course I realize the correct answer is "Call the odds even, we have no information.'" Which seems tantamount to "I can plainly see that A's odds are better than 50-50, but I will yield to the way I was taught the Significance Test." I was struck by how often the words "of course" came up in deference to the "no information" option.

The contradiction I mentioned? People often use a statement about probability in arguing that they can make no statement about probability. Or only the most rigid, artificial kind of statement. They see the nonsignificant p-value and deny the ability to further interpret the probability or odds of their result. Of course, this becomes less and less tenable the closer the p-value comes to their cutoff of, e.g., .05. Few people would say that they have no information when p = .06. And yet they still might be reluctant to assign any odds. Under the distortion I'm talking about, it is only when p slips below .05 that suddenly some odds other than 50-50 seem to spring into being. And at this point many would be willing to specify those odds as exactly 19:1.


Before people take their first course in statistics, certain meaningful questions come naturally. It would seem strange to reduce inherently interesting research topics to artificial, either/or probabilistic statements. Yet in the process of learning significance tests, we have that natural interest beaten down, and what we come to report and discuss about research are the nearly sterile and often misleading results of hypothesis testing, often mixed with misconceptions.

There are prominent voices that for some time have been advocating a change. If you look up the work of Gene Glass, Jacob Cohen, Robert Rosenthal, Laurence Phillips, or Michael Oakes, you'll encounter eloquent arguments for deemphasizing or marginalizing significance testing in favor of better estimating what some quantity is likely to be. This group began making these arguments as early as the 1970s. They have had some effect among those passionate about statistics, as is shown by the recent rise of Bayesian methods among some of the most creative and advanced quantitative thinkers. But among the rank-and-file, and especially among those who draw on statistics only pragmatically, just enough to make their case in psychology or nutrition or what have you, hypothesis testing with its attendant thinking is still considered the essence of quantitative methods.

Do I offer any recommendations? For one thing, when you report research results, try to keep thinking about what sort of information people can best use. Seldom will that be simply a significant/nonsignificant classification. Often it will turn out to be your best estimate of a quantity, accompanied by some sort of interval containing its likely upper and lower limits. That could mean a traditional confidence interval if you're lucky enough to be analyzing random samples or comparing two groups who have been randomly assigned and who are solidly representative of their respective populations. If you're not, hopefully you will use your judgment to create some kind of modified confidence interval -- perhaps using the Bayesian method to arrive at a "credible interval." 5

Most important, don't let anyone drum your curiosity out of you! If your course in significance testing is not making sense, it's quite possible the instructor has glossed over, or failed to thoroughly consider, some of the implications of the method. Teaching statistics is almost universally difficult. I have met intelligent, dedicated people for whom teaching hypothesis testing is so problematic that they avoid engaging in conceptual discussion with students along the way. If you run into this, stay committed to real understanding, and do what it takes to get your questions answered.

"Seize the moment of excited curiosity on any subject to solve your doubts; for if you let it pass, the desire may never return, and you may remain in ignorance."

-- William Wirt (1772 - 1834)


I welcome any comments you'd like to make privately or publicly about this piece.

Technically-minded people who enjoyed this essay may want to follow up by reading Jacob Cohen's "The Earth is Round (p< .05)" or the recent position paper of the American Statistical Association.

1. Rob Stein, Washington Post, December 6, 2007.
2. Sharp AL, Choi H, Hayward RA (2013). Don't get sick on the weekend: an evaluation of the weekend effect on mortality for patients visiting US EDs. Am J Emerg Med. 31(5): 835-7.
3. Boston Globe, August 6, 2010.
4. Drake Bennett, Boston Globe, October 14, 2007.
5. For some time when I reported statistical results based on other-than-random samples, I used a disclaimer such as

While, strictly speaking, inferential statistics are only applicable in the context of random sampling, we follow convention in reporting significance levels and/or confidence intervals as convenient yardsticks even for nonrandom samples. See Michael Oakes's Statistical inference: A commentary for the social and behavioural sciences (NY: Wiley, 1986).

copyright 2008-18 by roland b. stark.

yellowbrickstats homepage | my statistical and research consulting