yellowbrickstats home | my statistical and research consulting
"Hypothesis testing is admirably suited for the allaying of insecurities which might have been better left unallayed."
In my first few years studying statistics, I thought significance tests were fascinating, and remarkably useful. How nifty that one could take a real-life result in practically any field and test it to see how likely such a thing would occur by chance alone. And I eagerly learned all manner of methods for conducting these tests -- to compare group averages, to evaluate runs or streaks, to assess the strength of associations between variables, and so on.
In those days I never imagined I would become so disenchanted with the way people use these tests. I find that people overwhelmingly focus on significance/nonsignificance and fail to address the natural, more important next question, of How much? or How strong? or How different? It's remarkable how often people stop at the significance test and either gloss over or ignore completely the effect size, the parameter of interest. All over the mass media, in the vast majority of articles reporting quantitative results, this kind of thinking impoverishes discourse, in ways that might not be apparent at first glance.
Just a few examples:
- "Key study links childhood weight to adult heart disease"1 takes up almost 1/4 page of a major newspaper. It provides all manner of details about the study in question as well as portions of an interview with the investigators, but we never learn how much more likely heart disease might be among those who were obese as children. We are left to guess whether it is 60% more or 2% more, and as I'll explain below, we usually do a poor job of guessing.
- "Don't get sick on the weekend"2 implies a striking disparity between weekdays and weekends in mortality rates following emergency-room visits. Careful reading reveals that when the weekday mortality rate is 1.000%, the weekend rate averages...1.026%! If you round to the nearest tenth of a percentage point, you'll see no difference.
- "Disappointing jobs data weighs on stocks"3 is the headline, followed by the explanation: "A surprisingly poor signal on the jobs market sent stocks lower yesterday as investors remained worried about a lack of hiring." The supposedly "significant" drop in question represented 1/20th of 1% of the index's value.
I'd like to go into one such article in more detail.
"News flash: New research concludes that the sensationalism sweeping local news is bad for ratings"
So began an extensive piece covering nearly 1 1/2 pages of a Boston Sunday Globe.4 I'll grant that this author, after exploring the nuances, implications, and complications involved in the (unquestioned) finding about TV news ratings differences, finally did reveal the magnitude of this difference -- in a sidebar, paragraph 34 of 35. Before I present that difference, I'd like you to form a picture in your mind of what it might look like. Try to imagine two bell curves, one for the ratings of (the % of households watching) quality TV news stories and one for those of sensational stories. One curve will naturally be shifted to the right of the other, reflecting its significantly higher average.
The answer: the "strong correlation between high quality scores and high ratings" translated, at least in one class of results, into a mean difference of 0.25 ratings points, or 1/4 of 1% of viewing households. My graph below shows the size of this difference; the two narrowly separated dotted lines show the group means. The data are simulated.
Yes, the upper and lower halves look the same. If one approach was "bad for ratings" this is barely visible to the naked eye. There is much, much more variability within each group than between the two groups.
"But why haven't you labeled the numbers on the y-axis?" That gets exactly to my point. The sample size matters for statistical significance, but for the magnitude and thus the meaningfulness of the group difference, it doesn't matter whether the y-scale runs from 0 to 10 or 0 to 10,000. The look of the group difference would be the same.
People tend to overestimate the magnitude of differences or correlations that are reported merely as "significant." The mind's tendency -- need, really -- to create patterns has been extremely well documented by anthropologists and psychologists. You'll see this in the excellent literature on perception of statistics and probability, exemplified by the writings of Tversky, Kahneman, Oakes, and Gigerenzer. If we are given a small tidbit such as "this is greater than that," we naturally create a convincing mental picture involving a substantial difference. If our mind's eye happens to draw paired histograms (as above) in response to such a factoid, well, the distributions don't overlap very much. Similarly, when we hear in shorthand about a correlational finding such as "fish diet linked with higher IQ," we don't imagine some barely visible correlation such as 0.2,
but something that would clearly show up on a scatterplot -- a 0.7 or higher.
I surprise myself by how often I fall into the same kind of thinking, even about my own research results. I'll come out with some finding, say, a correlation of .25 between two variables; I'll begin a conversation soberly characterizing it as "a slight but statistically significant correlation"; and within 10 minutes, perhaps to justify the fact that I'm reporting it at all, I'll slide into treating it as a meaningful or even an important connection.
Sometimes, our training in significance testing can lead to a curious contradiction in thinking. Very informally, I asked the following question of five experienced educational or social-science researchers:
What are the approximate odds that Candidate A will beat Candidate B if, in a quintessentially scientific poll, Candidate A leads by 1 percentage point, with a margin of error of +/- 5 points for the lead?
A) The odds are even. The sample lead is not statistically significant and therefore we should discount it.
B) 6 to 5
C) 3 to 2
D) 2 to 1
Most of us are familiar with the kind of reasonable thinking that goes, "without a statistically significant result, we can't put too much faith into there being much of a lead, and we couldn't reliably call it greater than zero." From there, it's unfortunately not such a big slide into the incorrect idea that "without a significant result, we can't say anything about what the lead is." In other words, if our p-value is inconclusive enough (large enough), by this latter distortion we couldn't even say whether Candidate A had anything other than a 50-50 chance of winning. Whereas, if we merely and naively noted the small lead that we observed in the sample, we would strongly tend to believe she has at least somewhat the better chance. Most people untainted with hypothesis testing would affirm this. Who would be willing to give Candidate B even odds in a bet? Or, more telling, to give even odds across a large number of comparable situations? Those who would are under the sway of the same kind of thinking that produces that especially insidious phrase, "statistical dead heat."
But many a competent student of introductory statistics could
- start with the observed lead of +1;
- draw a bell curve with that as best estimate and 2.5 (half the typical margin of error) as the standard deviation for the estimate; and
- show that in the resulting curve about 2/3 of the area is to the right of zero.
Alternatively, one could use a standard introductory textbook's Appendix with Area under the Normal Curve to show that
- Candidate A's lead is equivalent to 1/2.5 = 0.4 standard errors from zero (Z = 0.4);
- being 0.4 Z from an hypothesized mean (which, under the null, is zero) places a result at the 66th percentile of that null distribution;
- the one-tailed p-value is 0.34, and
- .66/(.34) is about 2 to 1, which is a good estimate of Candidate A's odds of winning.
(It may not be immediately obvious why this should be a one-tailed test. But it wouldn't make much sense to conduct a two-tailed test, hypothesizing that Candidate A's lead is exactly zero.)
What 4 out of my 5 researcher-subjects said in essence was "I was going to favor Candidate A's odds, but of course I realize the correct answer is "Call the odds even, we have no information.'" Which seems tantamount to "I can plainly see that A's odds are better than 50-50, but I will yield to the way I was taught the Significance Test." I was struck by how often the words "of course" came up in deference to the "no information" option.
The contradiction I mentioned? People often use a statement about probability in arguing that they can make no statement about probability. Or only the most rigid, artificial kind of statement. They see the nonsignificant p-value and deny the ability to further interpret the probability or odds of their result. Of course, this becomes less and less tenable the closer the p-value comes to their cutoff of, e.g., .05. Few people would say that they have no information when p = .06. And yet they still might be reluctant to assign any odds. Under the distortion I'm talking about, it is only when p slips below .05 that suddenly some odds other than 50-50 seem to spring into being. And at this point many would be willing to specify those odds as exactly 19:1.
Before people take their first course in statistics, certain meaningful questions come naturally. It would seem strange to reduce inherently interesting research topics to artificial, either/or probabilistic statements. Yet in the process of learning significance tests, we have that natural interest beaten down, and what we come to report and discuss about research are the nearly sterile and often misleading results of hypothesis testing, often mixed with misconceptions.
There are prominent voices that for some time have been advocating a change. If you look up the work of Gene Glass, Jacob Cohen, Robert Rosenthal, Laurence Phillips, or Michael Oakes, you'll encounter eloquent arguments for deemphasizing or marginalizing significance testing in favor of better estimating what some quantity is likely to be. This group began making these arguments as early as the 1970s. They have had some effect among those passionate about statistics, as is shown by the recent rise of Bayesian methods among some of the most creative and advanced quantitative thinkers. But among the rank-and-file, and especially among those who draw on statistics only pragmatically, just enough to make their case in psychology or nutrition or what have you, hypothesis testing with its attendant thinking is still considered the essence of quantitative methods.
Do I offer any recommendations? For one thing, when you report research results, try to keep thinking about what sort of information people can best use. Seldom will that be simply a significant/nonsignificant classification. Often it will turn out to be your best estimate of a quantity, accompanied by some sort of interval containing its likely upper and lower limits. That could mean a traditional confidence interval if you're lucky enough to be analyzing random samples or comparing two groups who have been randomly assigned and who are solidly representative of their respective populations. If you're not, hopefully you will use your judgment to create some kind of modified confidence interval -- perhaps using the Bayesian method to arrive at a "credible interval." 5
Most important, don't let anyone drum your curiosity out of you! If your course in significance testing is not making sense, it's quite possible the instructor has glossed over, or failed to thoroughly consider, some of the implications of the method. Teaching statistics is almost universally difficult. I have met intelligent, dedicated people for whom teaching hypothesis testing is so problematic that they avoid engaging in conceptual discussion with students along the way. If you run into this, stay committed to real understanding, and do what it takes to get your questions answered.
"Seize the moment of excited curiosity on any subject to solve your doubts; for if you let it pass, the desire may never return, and you may remain in ignorance."
-- William Wirt (1772 - 1834)
I welcome any comments you'd like to make privately or publicly about this piece.
Technically-minded people who enjoyed this essay may want to follow up by reading Jacob Cohen's "The Earth is Round (p< .05)" or the recent position paper of the American Statistical Association.
1. Rob Stein, Washington Post, December 6, 2007.
2. Sharp AL, Choi H, Hayward RA (2013). Don't get sick on the weekend: an evaluation of the weekend effect on mortality for patients visiting US EDs. Am J Emerg Med. 31(5): 835-7.
3. Boston Globe, August 6, 2010.
4. Drake Bennett, Boston Globe, October 14, 2007.
5. For some time when I reported statistical results based on other-than-random samples, I used a disclaimer such as
While, strictly speaking, inferential statistics are only applicable in the context of random sampling, we follow convention in reporting significance levels and/or confidence intervals as convenient yardsticks even for nonrandom samples. See Michael Oakes's Statistical inference: A commentary for the social and behavioural sciences (NY: Wiley, 1986).
copyright 2008-18 by roland b. stark.
yellowbrickstats homepage | my statistical and research consulting