yellow brick road to stats heaven

~ a loose collection of statistical and quantitative research material for fun and enrichment ~

by roland b. stark

critique: occasional commentary on research methods and analyses

yellowbrickstats home     |     my statistical and research consulting

paradoxical reversals after analysis

striking findings on baseball umpires

key issue missing from reporting on harvard race-conscious admissions

ingenious research linking tree cover with student learning

how (not) to assess the effect of images in warning labels

"of poohsticks and p-values: hypothesis testing in the hundred acre wood"

gun control: the right research evidence makes policy decisions easy

a brilliant look at public protest using a natural experiment

how not to attribute causality from statistical results

readmission rates: 58% of variance explained!?

paradoxical reversals after analysis

May 20, 2019

Does it drive you crazy to see two analyses of the same data reaching opposite conclusions? Just discovered Simpson's Paradox, Lord's Paradox, and Suppression Effects are the same phenomenon – the reversal paradox, by Yu-Kang Tu, David Gunnell, and Mark S. Gilthorpe (Emerging Themes in Epidemiology 5.1, 2008).

Such contradictory results are all too common. It might seem at first that more of X causes an increase in Y, but when we control (or adjust) for Z, we find the opposite! I’m continually interested in ways to better use analysis to understand cause and effect, and to distinguish causation from mere correlation. So it’s important to get a handle on when and why such contradictions can occur, and what’s the best way to interpret them.

The authors methodically explain what conditions can lead to such reversals. They show how each of three types of reversal effects can occur when statistical control is introduced, and they explain how variables’ level of measurement (categorical or continuous) affects the type of reversal that can occur.

Most important, Tu et al. stress that when we decide whether to control for some confounder, or nuisance variable lurking in the background, we shouldn’t make this decision purely on statistical grounds. It takes sound knowledge of the subject matter in question, and not merely statistical know-how, to design an analysis that will produce solid and believable cause-and-effect results.

“It's easy to lie with statistics; it's easier to lie without them.” Frederick Mosteller

striking findings on baseball umpires

January 7, 2019

An ingenious FiveThirtyEight article by Michael Lopez, Brian Mills, and Gus Wezerek tries to show that "Everyone Wants To Go Home During Extra Innings — Maybe Even The Umps." They find that in extra innings major league umpires, probably unwittingly, change their patterns of ball and strike calls in ways that tend to end the game quickly.

The authors analyzed a sample of roughly 32,000 pitches thrown between 2008 and 2016. They obtained data using Bill Petti’s baseballr package, scraping pitch locations from

I love the fact that they undertook this work, and their nifty data graphic, but I wish it were clearer what question each result answers.

At one point the main question is presented as a) How much umpires tend to favor calls that would hasten an ending, comparing certain extra-inning scenarios vs. ordinary scenarios.

At another point it's stated as b) Strike rates in certain extra-inning scenarios for "teams that are in a position to win vs. teams that look like they’re about to lose."

A third and more complex comparison is implied by c), How umps "changed their behavior in these situations between 2008 and 2016," but I doubt this is what the authors intended to say.

Comments to the article abound, but until we know for sure what each finding means....Finally, not that statistical significance is the be-all and end-all, but it wouldn't have hurt to run a significance test or two, to let us know just how unusual the differences cited would be if one supposes they occurred by chance.

key issue missing from reporting on harvard's race-conscious admissions

October 13, 2018

I've looked in vain for a good, in-depth treatment of the Harvard case centering on anti-Asian bias. The Oct. 11 New Yorker column by Harvard Law professor Jeannie Suk Gersen introduces the problem but declines to cite a single number. Elsewhere, reporting commonly cites Asian-Americans' outsized percentage of the Harvard student body vs. their percentage of the US population. What I don't see is any source definitively reporting this group's admission rate as compared with other races'--let alone pinpointing that difference when one controls for other relevant factors. That's the crux of the matter.

The Oct. 12 Nell Gluckman article in the Chronicle of Higher Education suffers from this deficiency. So does Colleen Walsh's Aug 31 Harvard Gazette story. Somewhat more helpful is this passage from Julie J. Park's Sep. 24 Inside Higher Ed column:

"According to an expert report filed in the case on the side of Harvard by David Card of the University of California, Berkeley, the admit rate for the Classes of 2014-2019 was 5.15 percent for Asian Americans and 4.91 percent for white applicants who are not recruited athletes, legacies, on a special dean’s list or children of faculty/staff members. It is problematic that white people are more likely to fall into these special categories [....]"

This leaves me to imagine that an apples-to-apples comparison, one which adds back all such special categories for Whites, could yield racial admit-rates that are sharply different, on the order of 12% vs. 5%, or rather similar, such as 7% vs 5%.

More helpful still is the Economist story from June 23. It describes an intriguing result from the plaintiff's consulting economist, Peter Arcidiacono, using an unspecified "statistical model." Controlling for other (unspecified) factors,

"He estimates that a male, non-poor Asian-American applicant with the qualifications to have a 25% chance of admission to Harvard would have a 36% chance if he were white. If he were Hispanic, that would be 77%; if black, it would rise to 95%."

This summary, of course, describes a special, narrow case. The full analysis would presumably cover students from the entire socio-economic spectrum, from all genders, and so on, and those findings could hardly be as striking as these. We can only hope Arcidiacono's methods are given adequate scrutiny. Models purported to be establishing cause and effect, especially those that rely on statistical control, can go awry in so many ways. And they can lead to bizarre conclusions. The late statistician Elazar Pedhazur used to spoof analyses that in effect answered questions akin to "How tall would this corn plant have grown if it had been a tomato plant?"

ingenious research linking tree cover with student learning

October 1, 2018

It's heartening to see the original, high-quality research reflected in Might School Performance Grow on Trees? Examining the Link Between “Greenness” and Academic Achievement in Urban, High-Poverty Schools, a joint project of the U. of Illinois and the U.S. Forest Service. Ming Kuo, Matthew H. E. M. Browning, Sonya Sachdeva, Kangjae Lee and Lynne Westphal have admirably investigated the connection between amount of tree cover around Chicago schools and the extent of student learning in math and reading, while striving to rule out other factors that could explain the variation in student performance.

How unusual among educational research projects to gather data using "Light Detection and Ranging (LiDAR) collected with a scanning laser instrument mounted onto a low-flying airplane"!

One might be impatient to suggest, as I was, that amount of tree cover at school could be serving as a proxy for level of affluence in the neighborhood generally-- which would perhaps be a truer cause of achievement level. The authors thought of this too and controlled for it effectively in their sequential regression analysis:

"School Trees contribute uniquely to the prediction of academic achievement even after Neighborhood Trees are statistically controlled for. Neighborhood Trees, however, showed [little relationship with achievement] once School Trees were statistically controlled for. These findings suggest School Trees are stronger drivers of academic performance than other types of greenness, including grass cover and trees in surrounding neighborhoods."

I also recommend this article for its intelligent Limitations section.

how (not) to assess the effect of images in warning labels

June 29, 2018

"Ours is the first study to evaluate the effectiveness of sugary drink warning labels," touts Grant Donnelly, a lead author of a joint study by the Harvard Business School and Harvard University Behavioral Insights Group. Kudos for their smart approach to testing the effect of images as part of those warning labels (objective measures showed that images indeed brought about the desired reduction in purchases).

But shame on the researchers for ignoring or missing decades of psychological and behavioral-economics research on the best ways of investigating cause and effect. For the study also incorporated a naive direct question asking participants "how seeing a graphic warning label would influence their drink purchases." An abundant literature, from Nisbett and Wilson (1977) to my own recent article, shows that it would be foolish to trust in such subjective interpretations of the factors behind each person's decision-making process. After acquiring such good, objective information, why would Donnelly et al. water it down with subjective findings that are sure to introduce bias?

UPDATE: the original study materials made available by the authors at Open Science Framework tell a different story than the summary in the Harvard Gazette quoted above. The survey did not ask respondents "how seeing a graphic warning label would influence their drink purchases." Instead, the survey asked for reactions to the images and then asked about intention to buy a soft drink -- each topic much more amenable to unbiased reporting by a participant than the causal assessment would be. The responses would then be linked "in the back end" by the researchers to investigate any causal link. A good design.

"of poohsticks and p-values: hypothesis testing in the hundred acre wood"

Pooh Bear

March 13, 2018

Just discovered Eric D. Nordmoe's fun and informative creation from 2004. "A walk through Milne's Enchanted forest leads to an unexpected encounter with hypothesis testing." This enjoyable little article is instructive for those new to statistics and full of pleasing connections for the initiated.

gun control: the right research evidence makes policy decisions easy

March 12, 2018

Suppose a nationally-scaled, 30-year, multiple-author, peer-reviewed, non-partisan, public-health-oriented study concluded the following: "Where guns are more widely available, no more of the burglars and intruders are getting shot, but more of the gun-owners' family and friends are."

This is the cental finding of The Relationship Between Gun Ownership and Stranger and Nonstranger Firearm Homicide Rates in the United States, 1981–2010. The authors explain, "Our models consistently failed to uncover a robust, statistically significant relationship between gun ownership and stranger firearm homicide rates (Tables 3 and 4). All models, however, showed a positive and significant association between gun ownership and nonstranger firearm homicide rates." They add: "for each 1 percentage point increase in the gun ownership proxy, [stranger firearm homicide rates stayed the same, whereas] nonstranger firearm homicide rates increased by 1.4%. [Similarly,] a 1 standard deviation increase in gun ownership [13.8%] was associated with a 21.1% increase in the nonstranger firearm homicide rate."

The research is very sound.

  • Siegel, Negussie, Vanture, Pleskunas, Ross, and King paid close attention to the validity of the indicators they used, and they made intelligent use of a proxy when a direct measurement was not available. For their main predictor, "the annual prevalence of household firearm ownership in a given state," they substituted the percentage of suicides committed using a firearm, and they clearly explained why this would be effective.

  • The authors took great care to isolate the relationship of greatest interest by controlling for nuisance variables.

  • They conducted sensitivity analysis: where a judgment call might result in the choice of one analytic approach or another, they analyzed their data in multiple ways to see how much the results changed. One example of this was their treatment of missing data.

Can you refute their findings?

a brilliant look at public protest using a natural experiment

Women's March Jan. 2017

Sep. 9, 2017

Read Dan Kopf's excellent Quartz summary or the full article by Andreas Madestam, Daniel Shoag, Stan Veuger, and David Yanagizawa-Drott from Harvard and Stockholm Universities. Want to know to what degree political demonstrations produced results in elections? Track the rain. The rain? It actually makes a beautiful example of what's termed an instrumental variable. Whether it rains at protest locations can scarcely have anything directly to do with ultimate election results, but it unquestionably relates to turnout for each demonstration. If the size of turnout relates to election results, then the rain should, statistically (if not causally), relate to them as well. "If the absence of rain means bigger protests, and bigger protests actually make a difference, then local political outcomes ought to depend on whether or not it rained [on protest days]...As it turns out, protest size really does matter."

how not to attribute causality from statistical results

Sep. 9, 2017

[From a major outlet for health care research findings, Fierce Health Care. I've reproduced a key passage in black and commented inline in color.]

Employment status is the top socioeconomic factor affecting 30-day [US hospital] readmissions for heart failure, heart attacks or pneumonia, according to a new study from Truven Health Analytics.

[Such a conclusion is on very shaky ground, as you'll see.]

As readmission penalties reach record highs, analyzing causes is more important than ever.


Researchers, led by David Foster, Ph.D., collected 2011 and 2012 data from the Centers for Medicare & Medicaid Services and used a statistical test called the Variance Inflation Factor (VIF) for correlations among the nine factors in the Community Need Index (CNI): elderly poverty, single parent poverty, child poverty, uninsurance, minority, no high school, renting, unemployment and limited English.

[In truth, the VIF tells not what is the most important factor, but only to what extent the different factors, or independent variables, overlap with one another, potentially confounding the results. In this case, trying to isolate one indicator of socioeconomic status (SES) while controlling for eight others will surely distort the connection between any of these indicators and the outcome. These SES indicators are too much "part and parcel of" one another, too inseparable, to allow for valid use of control in this way. It's a mistake to ask "How much does SES (indicator 1) relate to readmission if we statistically remove SES (indicators 2-9) from the relationship?" Much like saying, "How addicted am I to desserts if you discount my intake of cookies, pie, and ice cream?" Or there's Monty Python's "Apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, the fresh-water system, and public health, what have the Romans ever done for us?"]

Their analysis found unemployment and lack of high school education were the only statistically significant factors in connection with readmissions, carrying a risk of 18.1 percent and 5.3 percent, respectively, according to the study.

[As explained above, these are not valid conclusions to be drawn. But even if the numbers were somehow accurate, what could such statements mean? That readmission risk becomes on average 5.3% for non-high-school graduates? Can't be -- way too low. That it's 5.3 points higher than it would be otherwise? Can't be -- too high. 5.3% higher in relative terms? Maybe, but that would hardly merit calling high school education an important factor. So what's left?]

readmission rates: 58% of variance explained!?

Captain Obvious

Nov. 18, 2015

"Fifty-eight percent of national variation in hospital readmission rates was explained by the county in which the hospital was located," announce Jeph Herrin et al. in Community Factors and Hospital Readmission Rates, published in 2014 in Health Services Research. Sound odd to you? After all, for most readmission studies the percent explained is in single digits. Being able to account for 4 or 5% of the variation translates to an ability to assess individual risk that can meaningfully aid in clinical decisions. Even Harlan Krumholz and his team of 17 researchers and statisticians, the ones whose predictive models form the basis for the national readmission penalty system imposed by Medicare, have usually only explained 3-8%. And those models have taken into account about 50 input variables.

It turns out that Herrin et al. took their data on 4,073 hospitals and broke it down by 2,254 counties. There were almost as many counties as hospitals themselves. And many counties contained only a single hospital.

Now, suppose the authors had divided the 4,073 into, say, 4 groups defined by region, and found that the 4 groups had sizeable differences in readmission rate. That would have been a meaningful way to summarize the data. Even if they had formed somewhat more groups -- say, one for each of the 50 states -- that might have been meaningful; the data would have been spread pretty thin for some states. But to "explain" differences using 2,254 groups? It's not a far cry from simply listing the readmission rates of all 4,073 hospitals and claiming victoriously to have "explained" 100% of the variance in the hospital-to-hospital rate. Sounds like a feat for Captain Obvious.

One reason why this matters a great deal is that, to the extent that some geographic factor is considered responsible for this outcome, hospital performance will no longer be considered responsible. So if county in fact explained 58% of the variance, then hospital performance, it might be argued, couldn't account for more than 42%. This is the incorrect conclusion that was reported in unqualified fashion by news outlets such as Becker's Hospital Review.

The article by Herrin and colleagues makes contributions in other ways, of course, but the chief findings are very misleading. Watch for dialogue, in Health Services Research or elsewhere, on how to interpret the results. The upshot should be quite a bit more nuanced and moderated than what we've seen above. And if you're interested in the role of socioeconomic factors in hospital readmission, you'll find information at ReInforced Care, Inc.

yellowbrickstats home    |    my statistical and research consulting