Nickleby Babies and the End of AYP

Nickleby Babies and the end of AYP

Nickelby Babies and the end of AYPNow that we have seen our seniors flip their tassels, consider this: The class of 2014 is the first and last cohort whose school career was entirely governed by the AYP system of school accountability. They were kindergartners in January of 2001, when the No Child Left Behind (NCLB) Act of 2001 was signed. NCLB, though long past its expiration date, is still the law of the land. But AYP is over.

NCLB mandated that public school students be routinely assessed by states, that aggregate results of those tests be public record, that each test have a cut score above which a student would be deemed “proficient,” and that all students—no child left behind—must meet that proficiency standard. States were to define consequences for schools failing to ensure that every student was proficient.

But allowing that the accountability system was new, the law gave schools until 2014—the year then-kindergartners would be expected to graduate—to reach the 100% proficiency standard. In the meanwhile, states would define a year-by-year set of proficiency thresholds, such that a school that met the standards in a given year would be deemed as being on pace to reach 100% proficiency by 2014. For example, in Pennsylvania, where I worked in the middle of the AYP era, the targets for proficiency were as follows:

Pennsylvania “AYP” Proficiency Targets

Subject 2002-2004 2005-2007 2008-2010 2011 2012 2013
Math 35% 45% 56% 67% 78% 89%
Reading 45% 54% 63% 72% 81% 91%


These numbers refer to the percentage of students rated “proficient” on the state test in order for a school to be making “Adequate Yearly Progress” (AYP) towards the ultimate target of 100%. Although the actual numbers varied from state to state, every state had something similar. There is no such thing as an AYP target in 2014 and beyond, because this year, we are supposed to have all children proficient—the standard from now on is 100%.

But now that we are all talking about Common Core adoption, and our pundits and lobbyists are prognosticating about whether the mid-term elections will foster ESEA renewal, we have a brief moment to learn from the AYP system before we try and agree on a replacement.

Did the system of accountability which dominated discourse about public education for a decade actually work? If so, then the use of “AYP” was arguably a useful fiction. I say “useful fiction,” because even if it was useful, we now know that AYP was invalid at the most basic level: it did not measure what it purported to measure.

We can demonstrate this empirically. “Made AYP” was supposed to indicate that the school was on-track to have 100% of students proficient by 2014. (You can still read this claim on some state web sites—e.g., if you go here [] you will find in the explanation of 2011-12 results, “Schools that met AYP measures last year, or that were at ‘Warning’ status last year (e.g., the school did not meet AYP measures for the first time) will be on-track for meeting the NCLB goal of all students reaching proficiency by the year 2014 if they meet all AYP measures this year.”

Starting in the 2002-3 school year, every school district in the United States was supposed to determine whether each school either did or did not “make AYP.” These were predictions. This year, every school will or will not see 100% of its students proficient. This is reality. AYP was a valid indicator insofar as the prediction matched reality.

True Positives and False Negatives

As it turns out, there is a well-developed standard for assessing the quality of such predictions. For example, imagine a doctor develops a blood test for mortality. How good is the test? Well, the test result indicates that you are mortal or that you are not; and in reality you may be mortal and you may not. Test results and reality combine in four possible ways: A “true positive” happens when the test indicates you suffer from mortality, and in fact you are mortal. A “true negative” occurs when the test indicates you do not suffer from mortality, and you are truly immortal. A “false positive” happens when the test indicates mortality, but in fact you are immortal. A “false negative” occurs when the test shows you are not mortal, failing to detect the mortality you really have.

The validity of the mortality blood test is examined with three statistics: sensitivity, specificity, and accuracy. Sensitivity is the proportion of actual positives that test positive. If the test has sensitivity near 100%, it correctly identifies most of the mortals as mortal. Specificity is the proportion of actual negatives that test negative. If the test has specificity near 100%, it correctly identifies most of the immortals as immortal. Accuracy is the percentage of correct predictions overall. A perfect test would 100% specific, 100% sensitive, and 100% accurate.

Let’s say 100 people are given the test—Edward Cullen and 99 mortals—and the results are as follows: Edward’s vampire blood accurately tests negative for mortality (good specificity). So far so good, but 58 mere mortals also test negative (bad for sensitivity—they have mortality but the test fails to detect it.) Forty mortals accurately test positive for mortality.

Actual Mortality

Mortality Test




1 (true negative)

58 (false negative)


0 (false positive)

40 (true positive)

Of the 99 actual mortality sufferers, the test accurately detected 40, for 41% sensitivity. The single immortal tests negative, for a specificity of 100%. Across the whole matrix, there were 41 (40 + 1) accurate indications and 59 errors, for an accuracy of 41%.

Now, let’s consider “Did not make AYP” as an indicator of being off-track. If a school did not make AYP in a given year, and then does not end up with 100% of students proficient in 2014, we consider that a “true positive”—the school does in fact have the dreaded leftbehindis predicted by the AYP designation.

To check on the validity of the AYP metric, we can take any given year in which AYP was used, look at the schools which did or did not make AYP, and check against 2014. As an example, let’s look at a major urban school district at a time when AYP was ascendant. This is what the school district presented to its governing board after the 2007-2008 school year:

Sean Blog-graph-july2014

In 2004, 59% (182 of 308) schools made AYP. None of the schools in this particular district are going to see 100% of their students score at or above proficient in 2014. We can now find the sensitivity, specificity, and accuracy of AYP 2004 as an indicator of 2014 leftbehindis.

100% Proficient in 2014

<100% Proficient in 2014

“Made AYP” in 2004

0 schools (true negative)

182 schools (false negative)

“Did not Make AYP”

0 schools (false positive)

126 schools (true positive)

The sensitivity of 2004 AYP is 41%, because only 41% of the schools that we now know to have leftbehindis were indicated as being off-track ten years ago. The specificity of 2004 AYP is actually impossible to calculate for this city, because there is no school without leftbehindis—they are all mortal. Nationally, though, the specificity will be very nearly 100%—because the very few schools that do make the 100% threshold are those that have made AYP most years along the way. (If there are any Edward Cullens, their superpowers will have been on display all along.) The accuracy of 2004 AYP is also 41%, because the 126 true positives are also the only accurate indicators.

Think again about the blood test that indicates 41% of people are going to die, and imagine someone publishes their names in the newspaper. Nobody will want to marry or hire those people, right? And it is certainly true that they will die, so they can hardly claim slander. But if the test, like AYP in this case, is only 41% accurate and has 40% sensitivity, nearly all of the other 59% of people are going to die too. The test tells us nothing; but if we trust the test, we stigmatize the true positives.

Nationwide, the AYP metric has generated almost exclusively false positives and true negatives. This means that while it was in heavy use, AYP conveyed almost no information because the condition it was attempting to measure (off-trackness) turns out to have been a condition almost every school has. So, in its literal and basic sense, AYP was demonstrably an invalid indicator.

And we made it worse: most states implemented Safe Harbor calculations and the Confidence Interval allowances and growth projections and appeals processes to give schools a better chance, and every single one of those extra ways to “make AYP” generated an additional false positive, and thus rendered the indicator less sensitive, accurate, and valid.

But remember, the validity of a measure depends on the use to which it is put.

Was AYP a valid tool for sorting schools by performance?

Until about three years ago, hitting the annual AYP target was tremendously serious in the annual sorting of schools: for public relations, for enrollment decisions, for targeting of interventions, for charter school renewals, for school turnaround initiatives, and so on. So, was AYP valid in the way it was primarily used—that is, as a shibboleth?

Schools that perennially fared the worst under the AYP categorizations usually were poorly performing schools by any measure. Consistent not-making-AYP categorizations gave parents more power to opt out of those schools, and they gave administrators more options in re-structuring them (whether those options led to sustained improvements is a separate question).

As the thresholds ramped up so that more and more schools failed to make AYP, the utility of AYP to distinguish between schools actually declined—if all schools were designated as failing, AYP might still be useful as a political argument for privatization or generic grandstanding, but it would not be at all useful for determining which schools need particular attention and support. So, curiously enough, the more valid AYP got in its literal sense (predicting whether a school would be 100% proficient), the less useful it became for categorizing schools. The 2014 goal was so unrealistic that an accurate measure of off-trackness would simply have put all schools into the same category. That (and the prospect of reporting to parents and press that 100% of local schools were failing) is why NCLB waivers had to happen.

Did focusing on making AYP improve public education?

This question is hardest of all to answer, because so many other things happened concurrently. We may start by asking ourselves this: Is the class of 2014 better equipped for life than the class of 2001? If so, then the use of “AYP” was arguably a useful fiction.

But in my view, the binary categorization (made/failed to make AYP) misleadingly created a sharp line between schools just-below and just-above the thresholds. Those thresholds were arbitrarily defined and are now demonstrably invalid. Schools’ struggles to get across those thresholds led to a tremendous waste of energy and resources, a shockingly widespread culture of cheating, and—as I hinted in a previous post on proxy measures—instructional decision-making not always in the best interest of the students.

No matter how useful it might be, a metric that is not what it seems to be is confusing and dangerous. “Useful fiction” is simply a polite way to say “expedient lie.” When the dust settles about who is doing Common Core and how, and we return to the question of setting goals and consequences in the post-AYP era, we would do well to remember the lessons of AYP—a metric that seemed most useful precisely when it was least true.

Ebook-Cover-ImageGet Dr. Andrew Ordover’s Free E-Book, Building Student Character in the Classroom!


Leave a reply

Your email address will not be published. Required fields are marked *