yellow brick road to stats heaven

~ a loose collection of statistical and quantitative research material for fun and enrichment ~

by roland b. stark

An inventory: 11 issues with value-added studies
(evaluations based on student test scores)

Originally published at WIDE Wonders: Musings by the WIDE World Research Team,
Harvard Graduate School of Education, 2006 homepage | my statistical and research consulting

Even though most standardized tests for K-12 students were designed with the individual student's learning in mind, we often gravitate towards the use of such scores when we seek to trace the success of a teacher, program, or school. They seem so objective, so unambiguous, so well suited to the task. And often our impulse is to reward teachers whose students score highest, to demand more of the rest, and perhaps to direct more resources their way.

Now at some point it may become apparent that schools and districts with the highest average scores are also those with the most affluent student populations. (Systematic studies consistently find that income accounts for 60-80% of the variation in test scores among different groups.) Realizing this, many will call instead for tests at start and end of the school year and advocate rewarding the schools and teachers whose students show the greatest improvement. This is an eminently natural position to take. But it turns out to be fraught with an extraordinary number of challenges.

Statistical models for assessing contributions to student learning (value-added models, if you can stomach that awkward yet adhesive term) have received intense scrutiny among educational researchers in the past five years. I became attuned to this literature after I ran into some serious roadblocks in an effort to isolate contributors to reading and math improvement for children in a school touched by Harvard's WIDE World program. I began to see that gathering more complete data and using more sophisticated methods (hierarchical linear models, propensity scores) could solve only some of my problems. And I began to collect my own and more distinguished researchers' impressions of the hurdles one might need to overcome to develop a sound explanatory model of test score change. What follows is a list of these issues. While no study is likely to involve all of them, most will bump up against quite a few.

1. Studies can easily confuse effects from individual students; from being among certain students; from teacher; from intervention; and from school. To what should we compare a certain result - to the results that would have occurred if the student(s) had not been in school at all ? if they had been in school, but had stayed in the previous grade? if they had been in a different school? with a different set of classmates? with different teachers ? Questions such as these are too often neglected, detracting from the soundness of research claims. (1), (5)

2. Few students study under just one teacher, making it perilous to try to attribute gains or losses to an individual teacher.

3. What students in some classes learn may spill over and reach students in other classes ("contamination").

4. Groups of students do not often stay together long-term, so while students may exert effects on one another - which are difficult enough to measure - these are extremely difficult to track longitudinally. (1), (5)

5. It's difficult to separate past effects (of any of these types - from teacher, school, or set of classmates) from more recent ones. (1)

6. Variations in the policies by which schools assign students to special education or English as a Second Language programs can distort results, as can any pattern of biased exclusion of students from testing. Students not promoted will be left out of any calculation of year-to-year change, when including such students would lower the group score. (According to Walt Haney, this is one source of the spurious "Texas Education Miracle" of the late 1990s.) (1), (2)

7. Inclusion of different-enough schools in a study means one must extrapolate to a point beyond the reasonable. E.g., suppose that, within schools with 0%-30% limited English-proficient (LEP) students, each difference of 10 percentage points in LEP is linked with an average test score difference of 3 points. That would mean a 3-point score difference for a 0% LEP school compared to a 10%, and a 6 point difference for a 0% compared to a 20%. However, for a school with a % LEP far outside that range, such as 60%, that relationship may not hold at all. The slope might get much flatter or much steeper. In such cases trying to adjust or control for % LEP would yield misleading results. (1)

8. Thomas Kane and Douglas Staiger have shown that 50-85% of year-to-year variation in group test scores can be attributed simply to yearly fluctuations in the academic levels of incoming student cohorts. In other words, to noise: to something that has nothing to do with the teacher's or program's effectiveness. Differences between student groups within a year figure to be subject to noise as well. The authors also convincingly show that, because group averages fluctuate much more for small groups, it is the smaller schools who are more apt to suddenly rise to the top or sink to the bottom, netting them undeserved rewards or penalties. Such attention-getting schools almost always end up closer to the middle of the pack the following year, demonstrating the principle of regression to the mean. Their temporary exceptionableness is due not to anything noteworthy such as an instructional change, but only to chance. (3)

9. Student performance in different subjects must be assessed via different instruments. It would be pointless to try to use a single instrument such as the SAT whether testing reading, world history, or advanced placement physics. And different tests vary in their propensity to show change, either because of differences in the relative difficulties of pre-and post versions or because of differences in either version's validity and reliability. This complicates any study involving multiple subject areas or multiple grades.

10. Since virtually all standardized tests in education rely to some degree on students' reading ability, value-added research results in all subjects other than reading will be compromised unless all students have achieved a certain minimum reading level. One's ability to think effectively with social science, math, or science material will not be picked up by a test unless that test is properly matched to the student's reading ability. Moreover, group comparisons are potentially invalidated if some groups are more affected by this problem than others.

11. It is often desirable to try to relate student outcomes to some kind of indicator of baseline teaching effectiveness. Some examples are years of teaching experience; type of certification or teacher preparation program; educational degree; professional development points; and experts'/administrators' ratings. Unfortunately, the first two of these have been fairly conclusively shown to be largely unrelated to test score outcomes, based on a recent, very large-scale study in New York City. (4) The other three variables seem unpromising based on WIDE's recent evaluation work, including an unpublished urban school study involving about 25 teachers and 300 students. This is not to say that teacher quality itself does not matter. Indeed, evident from Kane et al.'s recent paper is the very great need for some usable measure that can serve as a proxy for teacher quality.

Rubin, Stuart, and Zanutto (1) and Damian Betebenner (5) make several suggestions that I find to be key for thoughtful research using value-added models. Three seem to be the most important:

  • Randomize to the extent possible.

  • Collect data on as many relevant variables as possible; statistical control of these, while far inferior to equalizing through randomization, is still useful.

  • Be very careful to think through, and make explicit, your assumptions. The best analytical method for a particular study and research question will depend on these assumptions. Example: Is it reasonable to expect that no improvement would occur absent a certain intervention? If so, it makes sense to analyze gain scores, as with analysis of variance. Is it instead reasonable to think that all students would improve to some degree even without the intervention, and that their posttest score could be predicted as a linear function of their pretest score? If so, analysis of covariance would make sense.

I suppose it is clear by now that I am pessimistic about the prospects of modeling standardized test scores, or changes therein, as a way of isolating the contributing factors in student achievement/improvement. Rubin et al. take a stronger stand (p. 18):

[... We] do not think that [most value-added] analyses are estimating causal quantities, except under extreme and unrealistic assumptions. We argue that models such as these should not be seen as estimating causal effects of teachers or schools, but rather as providing descriptive measures. It is the reward structures based on such value-added models that should be the objects of assessment, since they can actually be (and are being) implemented.

(1) Donald B. Rubin, Elizabeth A. Stuart, and Elaine L. Zanutto (2003). A Potential Outcomes View of Value-Added Assessment in Education. Journal of Educational and Behavioral Statistics 29 (1): 103-116.
(2) Walt Haney (2000). The myth of the Texas miracle in education .
(3) Thomas Kane and Douglas Staiger (2002). Volatility in school test scores: Implications for test-based accountability systems .
(4) Thomas Kane, Jonah Rockoff, and Douglas Staiger (2006). What does certification tell us about teacher effectiveness? Evidence from New York City.
(5) Damian Betebenner (2006). Lord's Paradox with Three Statisticians. Presentation at the AERA Annual Meeting, San Francisco, CA. homepage | my statistical and research consulting