The US Census provides a variety of indicators of socioeconomic status. These can be assembled into a Wealth index and a Poverty index to describe each of 33,000 ZIP codes – in this case using principal components analysis. This creates scales where the US average is 0 and the standard deviation is 1.
Now, simple intuition might tell you that there’s a negative linear relationship between Wealth and Poverty. Correlation would support this to some degree (R = -.247; R2 = .061). But even the quickest glance at the scatterplot above should change your thinking. Nor, for that matter, is this essentially a quadratic relationship, where, with higher and higher Wealth, Poverty would decrease, only more and more slowly (R2 increases only to .076). Or a cubic relationship (.077; almost no increase at all).
As average Wealth increases, what happens to the average level of Poverty? If you compare the zone where Wealth is below -1.0 to the zone between -1.0 and 0, mean Poverty increases sharply; the highest Poverty values fall in this band. The top of the shape – Batman’s right arm in flight – leans to our right. Then, from Wealth of 0 to Wealth of 2.0, Poverty's mean decreases slightly -- and its range drops dramatically. Finally, further towards our right (the “left arm”), mean Poverty increases slightly once more, now with a remarkably narrow range.
Variability, and we might say heteroscedasticity, is clearly important to this relationship. For the small percentage of points that lie in the region to the left of -1.5 or to the right of +1.5, the mean for Poverty does a good job of describing the data, but everywhere in between there is so much variation that the mean hardly captures the story.
The next graph zooms in a bit; it displays the ZIPs in a more granular way; and it fits linear and lowess lines to the data, where “lowess” stands for “locally weighted scatterplot smoother.” It’s an exploratory, opportunistic alternative to fit lines that are directly determined by linear, quadratic, or cubic equations.
With this visualization we again see that the linear fit line does a very poor job of describing the pattern. The red lowess fit is better, though again it follows the mean rather than accounting for the variability. No amount of such modeling can account for the central puzzle of these charts: why ZIP-code Poverty takes on such a wide range of values only when ZIP-code Wealth falls in a narrow range just below the US average.
This piece summarizes a presentation at the Analytics Without Borders Conference held in Boston in February 2020, a joint project of Bentley, Bryant, and Tufts Universities and UMass Lowell.
I welcome any
you might like to add, publicly or privately.
copyright 2008 - 2020 by roland b. stark.
yellow brick stats homepage