- K-12 Education
- Higher Education
- Who We Are
This discussion has concluded. Join us for another online discussion soon!
Our newest report, Growth Models and Accountability: A Recipe for Remaking ESEA, explores state efforts to combine growth and achievement into a single accountability measure, moving beyond NCLB’s measures of absolute levels of student performance and proficiency, and offering policymakers a recipe for combining student growth with student achievement to create a more meaningful accountability system.
Join us June 21–24, 2011, as we take a closer look at the ideas presented in Growth Models and Accountability with experts Daria Hall (Education Trust), Douglas Harris (University of Wisconsin at Madison), Andrew Ho (Harvard Graduate School of Education), Craig Jerald (Break the Curve Consulting), and Education Sector's Kevin Carey. The discussion will be particularly useful as the reauthorization of ESEA draws nearer, and more and more questions around incorporating growth models surface in national conversations:
What can we learn from state growth model pilots?How much growth is enough? And, what are the challenges and opportunities for remaking ESEA to accommodate and promote growth models? Our panelists will address these questions and more.
This discussion will be updated daily. If you have a question or comment to share with the panelists, please email Renée Rybak Lang.
Kevin Carey: We'll start with a broad question today: When Congress finally gets around to reauthorizing ESEA, should it require all states to incorporate some kind of growth model into their K-12 accountability systems? If not, why not? If so, how much flexibility should states receive—what should be mandatory and what should be optional?
Andrew Ho: I am a psychometrician, not a policy expert, but I am happy to offer an opinion. I am just returning from the National Conference on Student Assessment, a state-practitioner-oriented conference run by the Council of Chief State School Officers. I saw no shortage of state interest in growth models, but there is immense variability and ambiguity of implementation. The federal role for growth should be requiring greater transparency, particularly with respect to the incentives that growth models set. Even simple models like those from the Growth Model Pilot Program have stark contrasts in student and school classification and, more importantly, in the incentive structures they create (Ho, 2011; Hoffer, et al., 2011). Common terminology and greater clarity are necessary to prevent unintended consequences.
As for specific mandates, let us remember that one of the central design flaws of NCLB was the insensitivity of the accountability metrics, both proficiency and AYP, to progress. Students and schools could make significant progress without receiving credit. Thus, I do think that a progress model should be mandatory. However, individual growth models are a class of progress models, and I think that Congress should not be so prescriptive as to mandate individual growth models without clearer agreement on what these entail and what their scope could possibly be (match rates). There are many interesting and effective ways to measure progress towards and past a standard, both at the school and student level, that are not well described as either growth or aggregate growth, and the feds should be open to these progress models even as they justifiably shift their focus from the insensitive status models of the past.
Craig Jerald: I agree with Andrew that the federal government should make some kind of student progress measures mandatory when evaluating school performance for accountability purposes. More than that, I think it's politically inevitable.
Kevin's follow-up question about how much flexibility states should receive in incorporating, interpreting, and using growth measures is the tougher question. But before discussing that, I think it's important to point out that the first concern for this policy conversation has to be a much bigger does of frank and honest realism about accountability, accountability measures, and what growth measures can and cannot accomplish. At one point, growth scores were being characterized as a kind of panacea for the problems in the current AYP formula. But the federal growth pilot and Andrew's recent research have proven such hopes to be naive.
Without the kind of statistical manipulation Ohio used in its growth pilot, growth models do not significantly increase the proportion of schools making AYP (or "AYG"), at least as long as some expectation of eventual proficiency remains. And Andrew's recent solo research convincingly demonstrated that different kinds of growth models carry their own "perverse incentives" and potential unintended consequences, perhaps at least as serious as those everyone has complained about for years under current AYP.
So an important observation in Kevin's recent paper should be front and center in every policy conversation about topic and in every communication with stakeholders: "No single mathematical formula [or set of formulas] can adequately capture all of the distinctions among schools." And I would add a corollary: "And no accountability system can ever be completely free of perverse incentives and potential unintended consequences."
Policymakers can certainly anticipate and mitigate such problems, but no system can ever escape them entirely, and you can't even mitigate them if you pretend they don't or won't exist. So in the upcoming federal policy debate, policymakers and advocates should acknowledge tradeoffs among different options, be very frank and honest about such tradeoffs with educators and the public alike, and then incorporate deliberate strategies to mitigate negative consequences as much as possible.
As for the flexibility question, the federal government is in a bit of a Catch 22 here. On the one hand, the present political climate is highly inimical to any proposal for a standardized set of requirements that would provide little flexibility for states. And I'm not sure that the federal government should try to choose one best system for all states to use, since the growth pilot and Andrew's research make it clear that there simply isn't one best system. Perhaps, at the very least, states should manage the tradeoffs inherent in choosing among different approaches to incorporating student growth into school accountability systems, since they’ll have to live with those tradeoffs.
On the other hand, if the next ESEA allows too much flexibility in designing school accountability systems, even the flexibility to choose among different growth models such as states enjoyed under the federal growth pilot project, states inevitably will end up with significantly different ways of evaluating schools. And therefore it will be possible for critics or watchdogs to obtain data showing that a school judged as making satisfactory progress in one state would not be judged so in another state, or vice versa. If past experience is any indication, the law itself will be criticized as "unfair" for that reason.
In fact, the current ESEA required states to use a highly standardized AYP formula offering very little flexibility, yet because it allowed states to continue using different tests and proficiency cut scores, the law was roundly excoriated as being highly inconsistent and unfair in its expectations. I would argue that such differences are the inevitable price we pay when allowing some amount of flexibility in holding schools accountable. But if the last 10 years are any indication, differences stemming from such flexibility will be perceived as some kind of flaw in the law itself rather than an acceptable tradeoff for greater state flexibility.
So where do I stand on the best place to draw the flexibility line? Personally, I'm not sure. We're still discovering many new things about these various growth models—strengths, limitations, weaknesses. And Andrew's earlier response today gives me pause as well. I want to know more about that wider universe of possibilities to which he referred.
Kevin Carey: I think Craig is right to point out the conflict at the center of this conversation. On the one hand, it's impossible to design a uniform, formula-based process for 1) Accurately gauging school quality in all of its many dimensions, and 2) Determining the appropriate intervention or response to that assessment. Certain aspects of NCLB embodied a hope that the right formula, properly enforced, could politician-proof the governance of K–12 education. That didn't work very well.
At the same time, the more decisions are left at the state and local level, the more variance we'll have, along with commensurate accusations of inconsistency, arbitrariness, and so forth. It's not a solvable problem so much as the inherent tensions of federalism. And I think the best way for federal policymakers to mitigate those tensions is to combine a regulatory policy of maximum information—mandating the collection of data about as many important things as possible, including progress and growth—with a political strategy of building consensus among state policymakers around key policy initiatives, as we have with the Common Core.
Daria Hall: Growth can play an important role in a new accountability framework, but it must be just that: one component of a larger accountability system. So to answer this question, we first need to step back and look at the broader ESEA accountability framework.
As a steward of taxpayer dollars and as the historic protector of the interests of disadvantaged communities, the federal government has an important role to play in setting achievement goals. In most cases, determining how improvement should happen is best left to state and local authorities. Therefore, the accountability provisions of a reauthorized ESEA should include the following:
- Aggressive but achievable goals—goals that expect increased achievement for all students and gap closing between different groups of students.
A requirement that all states:
- Set annual improvement targets toward those goals for all schools and districts;
- Make annual accountability determinations based on schools' and districts' performance relative to the goals and progress targets; and
- Develop systems of differentiated supports and consequences for all schools and districts based on their accountability determination.
- Careful attention to the transition to new, more rigorous college- and career-ready standards and assessments. Expectations for improvement and gap narrowing must be maintained once these new tools come online, but goals must reflect the new achievement picture to avoid a situation in which all schools and districts are falling far short of what's expected.
As I mentioned, growth can play an important role within this framework, allowing for a more complete picture of school and district performance, and providing important information on which to base decisions about what kinds of supports, interventions or rewards are appropriate. But the specifics of growth—including how it’s calculated, how it’s incorporated into accountability determinations, and how it's used to identify appropriate interventions—should be left to the states. We don’t yet have enough good information on how growth should be used, or what it will look like on the new, consortia-developed college- and career-ready assessments to set specific requirements in federal policy.
The bottom line: Congress should stick to setting clear, high goals for improvement and gap closing, and states should be encouraged to include growth in their accountability systems in ways that contribute to these goals.
Doug Harris: Should Congress require growth? Absolutely. Why? Because one widely accepted productive role for the federal government is to make sure useful school performance information is available for parents and local policymakers and, if all else fails, determining which schools require some sort of government intervention. The current ESEA measures do an extremely poor job of measuring performance because they violate what I describe in my book on value-added measures as the Cardinal Rule: Hold people accountable for what they can control (Harris, 2011). The federal government should require not just any information, but good information.
As Craig points, growth measures are not perfect and come with their own flaws. The main weakness (compared with status measures) is that they are somewhat imprecise. This is partly because of the low quality of the tests, but it's more just basic statistics—measuring change is harder than measuring status.
Why is this trade-off of sensible for imprecise worth it? First, because the imprecision is fairly manageable with schools because most have large numbers of students and this increases precision. Second, because for educators to really buy into the measures, they have to make sense and the current measures don't. People talk about "no excuses" but the current measures give educators an easy and legitimate one—the current measures barely even pretend to measure school performance in a meaningful way. Sandy Kress (President George W. Bush’s first education advisor and key NCLB architect) wanted to use growth, but it wasn't politically feasible. Now, it's feasible, so let's change it.
I'm hoping in the next round, Craig, that you can clarify your reluctance to move in this direction. You describe growth as "inevitable" and not a "panacea." You also write that "no accountability system can ever be completely free of perverse incentives." I don’t dispute any of this, but why no positive words about growth? Yes, there will be perverse incentives in any system, but some are worse than others and the perverse incentives we have now are really absurd. For one, we say we want to "leave no child behind," but then we set up a system in which there is every incentive for teachers of students left behind to leave their schools. Under the current system, those teachers will be given the scarlet letter—or, if they're lucky, be left alone—almost no matter what they do.
The bigger problems arise when we try to attach penalties and rewards. Any accountability system based on any performance measure creates incentives to manipulate the measures. Most penalties and rewards (except financial bonuses) also require putting schools into categories—"low-performing," "high-performing" etc. and this too creates perverse incentives around the performance cutoffs. This is not an argument against using incentives as I think they do have a role to play. My point is that the inevitability of some perverse incentives doesn't excuse using performance measures that seem designed to create the most perverse incentives possible.
Should the states have flexibility? To a point. For reasons I explain in my book, they shouldn't have flexibility over the basic definition of growth—it means changes in individual student outcomes over time. Not cohort-to-cohort growth (as in NCLB) and not growth-to-proficiency (as in the Growth Model Pilot). Again, the federal government's role should be to provide good information.
But defining growth is just the tip of the accountability iceberg. I think some state flexibility is warranted over setting performance categories, assigning rewards and penalties, and intervening in low-performing schools. It seems like we're headed in that direction and that's a good thing. Perhaps more significantly, the federal government is also intervening like never on a matter that is probably even more controversial—and educationally consequential—that is, teacher evaluation and teacher accountability. I'm sure Kevin doesn't want us to swerve off course into Race to the Top and the teacher elements of the ESEA re-authorization, but I think we have to at least point out the connection. Schools are groups of teachers (and school leaders and other important people), so teacher and school accountability are closely connected.
Kudos to Kevin and Rob for their report on student growth and ESEA. As they know, I've been making a lot of these same arguments for several years (and so the same and more in my book). I think there is a rising tide of agreement and that's nice to see—actually enacting the changes into law would be even better.
Craig Jerald: Doug, I thought I made it pretty clear in my first sentence that I have no reluctance about the federal government requiring use of growth measures. When I said “should require,” that’s exactly what I meant. In fact, Sandy Kress was not the only one disappointed that the feds could not require a combination of status and growth during conversations leading up to NCLB back in 2000-2001. I was at the Education Trust at the time, and we initially were just as enthusiastic about the possibility. You'll recall that the Education Trust did more than any other organization in the late 1990s to spread the word about the value-added approaches being pioneered on the ground in places like Tennessee and Dallas (Good Teaching Matters, 1998).
So it didn't take much imagination for us to envision the tremendous potential of looking at growth as well as status. Just grab a piece of paper and sketch a matrix that identifies schools according to "high status, low growth" or some other combination. And now that states like Colorado have actually produced such information—and done such a good job of it—it's hard for me to imagine anyone arguing that growth shouldn't be part of any effort to understand school performance.
Also, I clearly recall doing a lot of legwork during those ESEA reauthorization conversations to try to figure out exactly how much capacity existed for producing growth measures across the 50 states and DC. "How many states can do it now? How many are working on it? How many could do it within a few years? Can we require states to do it by a certain date and work out some kind of transition scheme? Can we at least allow the handful of states that DO have the capacity to start doing it now?" Alas, it simply wasn't feasible.
The reason I didn't spend a lot of time making an elaborate case for growth measures earlier today is that I don't think there's much debate about the question at this point. Is there any major education group or membership association publicly arguing against incorporating growth measures in the next ESEA? I can't think of any. I guess there might still be organizations concerned about capacity problems in some states. But that’s a different issue. (Anyone have a good read on that? Am I wrong in assuming all states should be able to do this by the time the next version of ESEA actually goes into effect?)
What I'm expressing is not reluctance about growth measures per se, but rather a deep and sincere hope that elected officials and policy advocates can have a frank and honest conversation about both the strengths and the limitations of growth measures. That's important so we don't create unrealistic expectations by overselling the next version of AYP and also so we can proactively confront potential unintended consequences.
Doug, I don't think you disagree with that generally, but we might disagree on the nuances. Having looked closely at Andrew's recent work, I believe the conversation about growth measures has to acknowledge more than the "instability" issue you mention. For example, Andrew points out that "trajectory" growth models could be gamed by depressing early test scores to increase the slope of the trajectory line. How is that a lesser "perverse incentive" than the incentive to focus on the "bubble kids" in the current AYP? And look no further than Texas for an example of the kind of political trainwreck that can occur if states aren’t proactive in explaining the quirks of "projection" growth models.
Now, are any of those flaws reasons to avoid growth models entirely? Absolutely not! The information that growth models provide simply is far too valuable not to exploit now that more states finally have a lot more capacity to produce such data. Again, I literally cannot imagine anyone spending more than five minutes on that Colorado website and arguing that growth should not be considered when evaluating school performance.
So beyond Kevin's follow-up question about flexibility, I think the real debate now is about the following:
First, in requiring states to incorporate growth into school accountability systems, should the federal government restrict the choice of growth measures only to the kind of "growth-to-proficiency" models allowed under the federal Growth Model Pilot Project? If not, should it at least require states to continue considering status as well as growth (a la Colorado's "high growth, low status" kind of matrix). Or should it allow states to move beyond status measures and proficiency concerns entirely and only consider growth?
Second, more broadly, should the federal government move away from requiring states to make dichotomous pass/fail pronouncements about school performance? (I know the current ESEA uses the term "needs improvement" rather than "failing," but that distinction has been entirely lost in the public discourse about NCLB.) I like Kevin's and Rob's argument in the Ed Sector paper that accountability systems should consider a range of important information on school performance and rely on a good dollop of human judgment to make decisions about how to help schools improve and exactly what to do when schools persistently fail children. However, if we move in that direction, do we even need some kind of system that lumps schools into these definitive categories like "making AYP" and "not making AYP"?
Daria, forgive me for saying this, but maybe it’s time to really open up this conversation. Begin with a clear set of specific goals for accountability. The first is to protect children, particularly poor and minority children, from having to attend schools where they have little or no hope of graduating ready for the next level of education. The second is to provide all schools with information about strengths and weakness plus assistance to build on strengths and address weaknesses so they can continuously improve. The third is to provide information about school performance to the public and parents. Then work backward from those three goals to create a coherent "theory of action" for how to achieve each of them, being honest about how the world really works including potential unintended consequences. If we did that, would we end up deciding that the best way to achieve all three goals is a system that lumps all schools into two categories?
Doug Harris: Craig, that's helpful. I still read the gist of your first note as one of reluctant acceptance of the idea. You said it should be mandatory without saying why and then followed with "more than that, I think it's politically inevitable." But it now seems what you meant was to turn to the conversation to the other important issues in the design of the accountability system that uses those measures. Certainly, these other issues—defining performance categories, deciding what to do with low-performers, etc.—are extremely important.
I don't agree on the feasibility question and I’ve had this same conversation with Sandy Kress. To comply with NCLB, states had to expand testing to cover more grades. Once you complied with NCLB, then school-level growth was automatically possible. (As proof, I did this by myself for a bunch of states using publicly available data all the way back in 2003.) It clearly wouldn't have been possible to do growth at the student level, because that requires the type of fancy data system that is only now becoming common, but growth in school-level averages would have been better than what we got.
I agree, as I argue in my book, that one potential perverse incentive of growth measures is to dissuade schools from focusing on early learning. But that also happens to be one of the easiest problems to solve—combine the first test (generally third grade) with growth in the subsequent grades to get a combined performance measure. This gives schools incentives to generate growth in all grades.
Education Trust does a tremendous amount of work to reduce achievement gaps and I'm certainly not familiar with all of the organization's reports, but I do recall well the two No Excuses reports you published around 2001 (no longer available on the web site). Those reports, intentionally or otherwise, strongly reinforced the NCLB status measure approach and received a lot of press (certainly much more press than my response! Though we did manage a response in the New York Times; see my web site here for the full report). I bring this up here mainly as another reason why I interpreted your original comments as reluctant support for growth.
In any event, the more important thing is that we seem to generally agree now on how to proceed. In addition to shifting toward growth measures, I agree completely with your and Kevin and Rob's arguments about having more fine-grained performance measures and thinking about all the goals of accountability, including especially providing opportunities and reducing the achievement gap for low-income and minority children. This, in addition to providing more and better information, is a fundamental role for the federal government to play.
Andrew Ho: Growth is our new weasel word, a word that builds false consensus. Like "proficiency" before it, the prospect of 50 different definitions of growth is enough to make policymakers blanch and newspaper headline writers salivate. The two general assessment consortia are not a near-term solution for a common growth model; they are working hard enough towards a common assessment model first. At the federal level, one solution is to be extremely prescriptive in the definition of growth and adequate growth. I don't think that is politically or practically feasible given local control and the capacity of many assessment systems.
Instead, as I mentioned before, the feds can encourage consistency by encouraging transparency. The common standards movement arose in part from the public shaming of low proficiency cut scores and State-NAEP discrepancies. These kinds of revelations should be part of the application process. Require clear maps of incentives, in terms of required growth disaggregated by groups and levels of performance. Require crystal clarity about how the incentives are expected to lead to improvement of teaching and learning for the target population. And anticipate how gaps in the measurement system may lead to gaming--it's worth remembering that the number of students and teachers in a growth model will always be less than the number of students and teachers in a status model.
I have been highly critical of status models in the past (Ho, 2008), but I have noticed that they can temper perverse incentives in growth models. As I noted and Craig mentioned, trajectory models encourage fail-first strategies, but overlaying a status model discourages depressing scores anywhere beyond the proficient cut score. Also, since regression/projection models work like inertial, super-status models, stubbornly (and accurately) predicting low-scores for consistently low-scoring students (see previous Hoffer, et al., and Ho references), a status model can reduce incentives to triage, albeit no better than status models of the past. Growth and status working in tandem create a better accountability system than either on their own.
Doug, on edit, as an aside, I'm not sure how averaging solves the fail-first incentive. Perhaps you mean that it dilutes it. All else equal in a trajectory model, we still want 3rd grade scores to be low, don't we? Perhaps the ambiguity arises from the fact that we haven't defined what we mean by growth. I don't want to get too technical, but maybe we can lay out some distinctions, such as the difference between growth and conditional status (like Colorado's SGPs), in later exchanges.
Doug Harris: Sorry if my earlier suggestion was unclear. I was trying to say that combining the first score level (e.g., 3rd grade) with the growth in subsequent years would avoid the perverse incentive because paying less attention to student learning prior to the first test would reduce the combined performance measure. Combining the initial level and subsequent growth of course has the effect of reducing the proportion of the performance measure that is really focused on growth. But in my view (and it sounds like Craig’s as well) the idea of undermining PK–3 learning so that we can focus on growth is deeply problematic and needs to be avoided at all costs. I hope that’s clearer.
Craig Jerald: I'm sold on Andrew's assertion that "growth and status working in tandem create a better accountability system than either on their own." Both seem to provide important information about student performance, and he makes a good case that using them in tandem can help mitigate some problems with using either alone. From her earlier response, it sounds as if Daria would agree with that as well, and from his paper, I'm pretty sure Kevin does. What about you Doug? Do you see value in requiring states to continue to incorporate status measures in tandem with growth, especially since you believe the federal government should not permit states to define growth as growth-to-proficiency as in the Growth Model Pilot Project? I'd be interested in hearing your thoughts and reasons either way.
Doug Harris: I think there is a role for status. First, it should be part of every school’s report card and have some small percentage of any composite measure. Second, it’s important in elementary schools to encourage schools to focus on K-3 learning, as I mentioned earlier.
I like what England does. They identify “low-performing” schools based on a variety of factors, but they have a policy of not intervening in schools that have high growth. More generally, I think that penalties and rewards should be based almost entirely on growth and value-added (except for the elementary school partial exception above).
I also like Kevin and Rob’s idea of having status more of a role at higher units (e.g., school districts) because at those higher levels, the factors affecting achievement that are arguably external to schools are internal to the community. But, as I’m suggesting above, the other dimension in deciding how to use proficiency is the degree of direct stakes attached.
Kevin Carey: It’s helpful to break down accountability into its component parts, which I would describe roughly as follows:
- Measure what you care about
- Interpret results of measurement
- Act on results of interpretation
The education policy debate has been preoccupied with the first step, measurement. Many common critiques of NCLB describe focus on inadequacies and limitations of the current measurement process—the tests aren’t good enough; we only focus on math and reading; we don’t take into account 21st century skills; we don’t measure growth, and so on. These are all fair observations and steps can be taken to address them. But to my thinking, not enough attention has been paid to (2) and (3) – and these are the really difficult things to do. It’s not actually terribly difficult to identify schools that are abject failures. As we learned from the growth model pilot project and the longer two-decade history of the standards and accountability movement, some schools look bad no matter what measurement process you use, because schools that can’t teach basic skills in math and reading tend to be bad at everything else. The Obama Administration deserves credit for moving to help students trapped in such schools with a new sense of urgency. But from a policy perspective (not politically, mind you), identifying a really bad school and replacing it with something completely different is about the simplest accountability maneuver to pull off.
The rest of the education world is a tougher nut to crack. And that’s where I think it’s incumbent on policy analysts to be specific about how they expect steps (2) interpretation and (3) action to be accomplished, and the likely consequences of those policies. So when Daria says “systems of differentiated supports and consequences” —what does that mean? Is the differentiation mechanistic, via formula, or will certain people be tasked with applying human judgment to the question? Same thing with determining the supports and consequences. How? Who? These questions seem more important to me than deciding what percentage of some composite measure out to be influenced by status, since that assumes that step (2) interpretation will be accomplished via some kind of formula-based conglomeration of measures that triggers a predefined step (3) action, which is probably a bad idea to begin with.
Daria Hall: To be clear, Ed Trust is not calling for an accountability system that “lumps all schools into two categories.” In our extensive work with Congress, the Administration, and other policy and advocacy organizations we’ve stressed that one of the biggest problems with the current system is that it paints very different schools with the same brush. A school that’s persistently low achieving for all students is different than one where overall achievement is OK, but a particular group of students is struggling. Both schools need different kinds of support and intervention. Likewise, as widely acknowledged in this and just about every other discussion of accountability policy in recent years, a school that’s low on status and growth is very different than one that’s low on status but high on growth.
At the heart of our recommendations is an expectation that states develop systems that categorize schools based on a range of indicators — including, but not limited to, status and progress, both overall and for groups — and devise systems of supports and interventions that reflect the needs and capacities of schools in these different categories. As Kevin and Craig both point out, this is impossible to do perfectly, but it can certainly be done better than it is now. And it’s something that we’re excited to see factor prominently into the Council of the Chief State School Officers’ new roadmap for accountability.
But as I said in my previous comment, all of that has to take place with a system of clear goals, and that’s where the federal government can play its most important role. Any new federal action must be rooted in clear, actionable expectations for raising achievement and closing gaps.
Getting to the more recent questions: Do I think that the combination of status and growth can be a powerful tool for prompting this kind of improvement and gap closing? Absolutely, and it’s something that can and should be promoted through legislation, waiver, even a reopening of the growth model pilot depending on how other policy developments progress. But there are so many questions, many of which have surfaced in this conversation, about how to calculate growth and include it in accountability that I’m less than sure that federal policy can responsibly set specific parameters for it at this point. I say that in full knowledge that the states do not have a strong track record of making flexibility serve equity and achievement. That’s why there will always be a need for thoughtful policy monitoring and strong advocacy.
I’d be curious to get the group’s thoughts on the implications of the transition to new, more rigorous college and career-ready standards and assessments. How much effort should go into retooling the measures in state systems before the transition? For those states that aren’t currently doing growth, how much value is there in starting it now, if the assessments are expected to change in 2014–15 and some states are eager to start dropping in new standards into their current assessment framework even sooner? Likewise, for Andrew and others, how much do we know about what PARCC and SBAC will allow for growth that can inform our thinking?
Craig Jerald: Kevin, I like your tripartite description of accountability components, and I strongly agree with your assertion that we spend too much time arguing about the first and too little time discussing (and thoughtfully designing!) the second and third. But I think that's because we're not disciplined and clear and specific about the essential preliminary question: What are the specific goals for school accountability systems? Policymakers tend to talk about lots and lots of different purposes, including distribution of carrots and sticks. Here's the thing: If distributing praise/rewards and blame/consequences is the primary goal or even a primary goal for such systems, the focus inevitably gets stuck on the fairness of the measures and the flaws in the measures, and the conversation about that goal tends to overwhelm the entire discourse. So why not work from some clear goals everybody can agree to that specifically do not include distribution of carrots and sticks? I.e., providing all schools with the information and assistance to build on strengths and address weaknesses, while taking care to protect poor and minority students from having to attend schools where they have little or no hope of graduating ready to be successful in college and careers. Designing a thoughtful accountability system around just those two goals would be a challenging task!
Another reason to get clear on goals is because seriously attending to multiple goals increases the complexity and cost of any accountability system. Every goal requires an investment in a set of specific mechanisms to ensure the goal is met (processes, training, infrastructure, decision rules, etc), and those mechanisms together are what make up your system. I wrote a paper with Kristan Van Hook for the National Institute for Excellence in Teaching earlier this year that made the same point about teacher evaluation. That conversation, too, has been way too focused on the measures with way too little thoughtful discussion about interpreting and acting on the information. We said: "Any attempt to design a complex system (whether for teacher evaluation or school accountability or education finance) inevitably entails trade-offs among many possible design features, all of which might seem desirable in the abstract. Gaining clarity on specific goals for the system will help policymakers make thoughtful choices about" how to design the system and avoid merely giving "lip service" to one or more goals. A teacher evaluation system whose primary goal is to dole out rewards and sanctions or to identify and fire "ineffective teachers" will look quite different than one whose primary goal is to provide meaningful feedback and assistance so every teacher can improve every year—even when they use virtually the same measures!
Therefore, it worries me that the Obama administration continues to talk about school accountability systems distributing praise and blame and rewards and consequences. I guess someone might argue that rewards and sanctions are necessary to achieve the two goals I proposed, but I'm not at all sure that's true. Yes, when a school has very low status and growth measures year after year, aggressive interventions are necessary to protect children from harm, and those interventions can certainly feel like punishments to educators in such schools. But that's a very different thing than designing a system whose primary purpose is to distribute praise/rewards and blame/consequences. I think it's time to reframe the policy conversation about accountability, and the way to do that is first to get clearer (and smarter) about goals.
Daria, you raise some very good questions about the transition to the new Common Core State Standards and the new common assessment systems being developed by state consortia, but I have a board meeting to get to, so I'll weigh in on that later ... But something I contributed to the PARCC consortium's Race to the Top proposal is related to the observations I made above: If policymakers are serious about the "continuous improvement" goal (rather than just giving it lip service as a talking point), then these new assessment systems and accountability systems have to be designed to provide as much FOR teachers as they demand FROM teachers.
Andrew Ho: We've got a lot of great threads open, but I'll make comments on just two points. First, I'll talk about how I see the growth-to-proficiency models of the Growth Model Pilot, which many of you criticized. I don't disagree entirely, but I have a different framing. Second, I'll make a brief comment spinning off of Daria's question about our new buzzphrase, college and career readiness.
I think of growth-to-proficiency models as having three components. First, they have an operationalized definition of growth. Second, they come with a fairly restrictive definition of "adequate growth," in terms of predicted proficiency within X years. Finally, this definition of adequate growth was folded into an accountability model, one that classified schools in very different ways across states (confidence intervals? multiyear averaging? safe-harbor?). Conveniently and not coincidentally, these components map loosely onto Kevin's tripartite scheme.
What component led to the pilot being so variable across states and having, overall, a fairly modest effect? The easy answer is, all of them, but I think the contribution of the third, the vagaries of state AYP/AYG calculations, is underappreciated. My conclusion from the pilot is not that growth-to-proficiency models are too restrictive, but that we need to require transparency about all three components. I'll recast Kevin's scheme somewhat for my own purposes: 1) the definition of growth, 2) the evaluation of growth, and 3) aggregation of growth (and status) towards an actionable classification. At the first level, the range of growth definitions in the pilot is straightforward and familiar, although there are certainly stark contrasts we've described before. At the second, growth-to-proficiency is a commonsense approach to standard setting that we are seeing reincarnated rhetorically in the form of career and college readiness. At the third, I can only hope that aggregation and subsequent action will proceed with more nuance and care than it has in the past. In sum, I place much of the blame for the vagaries of the pilot results on the accountability models, not the growth models.
On the buzzphrase of career and college readiness, I'll stick to my psychometric guns and leave aside the mammoth undertaking of defining the amorphous term (Steve Dunbar just organized a good session on this at NCSA, and Wayne Camara's slides have many good references, not to mention what states contemplating assessment/accountability redesign should do while that's figured out (wait on the consortia, the reauthorization, or both?). I'll just state the obvious: "career and college readiness" does not answer the standard setting question. Linking tests to future outcomes may help to improve alignment between "readiness" cut scores across grades (at least until the relationships start to drift over time!), but it does not automatically define adequacy for either status or growth. I've already resigned myself to the upcoming rhetorical battle over who has higher standards, with far too little discussion of how these cut scores might actually function to incentivize growth for target populations.
Doug Harris: My take on Daria's question: We tend to talk about growth and value-added in terms of today’s achievement scores because that's what we have, but the basic concept can really apply to any student outcome, so long as we can explain a substantial share of the variation in that outcome with some prior information about students. Last year’s score for a given student is of course a very strong predictor of next year’s score, which is why growth is useful. (Technically speaking, the issue is a little more complicated, but I think that’s beyond the scope of this discussion.) I don't see any reason to think that this will be any less true of the set of assessments coming out based on the new standards.
One concern often raised is that some tests are not "scaled" properly for growth and value-added. What we want ideally is an "interval scale" so that increases the score by one point means the same thing no matter you are on the scale. But that's hard to accomplish in practice because we can't measure the quantity of knowledge and skill in the same way we can measure, say, the length of a pencil in inches. I tend think this issue is over-stated; I commissioned a paper from Derek Briggs at UC-Boulder for the National Conferences on Value-Added in 2008 and he found that school value-added measures were insensitive to scaling method.
Another potential concern is test measurement error. The more measurement error we have in the tests, the more random error we have; that is, the random error problem caused by growth itself (see my earlier post) gets worse as the test measurement error rises. Again, this is probably more a concern at the teacher level than the school level. I’m not sure where the new assessments will land on this point, but we can do growth and value-added either way. (As an aside, I strongly advise in my book against the idea of letting growth and value-added calculations drive test design. If we think we need open-ended test items to measure the skills we want—as I think we do—then that's how we should design the tests. The tests are there to serve the educational system, not to serve the performance measures. I know of at least one state that is planning to eliminate open-ended questions partly by this faulty logic.)
Whether the college and career ready standards are "valid" in any sense (such as being good predictors of actual college and career success) is really a separate matter. We certainly want the scores to be valid in this way, but that’s true whether or not we use growth. Also, I'm not sure we should spend too much time worrying about how "high" the standards are per se, since there isn't really an objective way to determine this. It will be determined by political forces and I think that's OK in a system where states have discretion. On the other hand, if the federal government was going to continue to determine which schools demand intervention, then the lack of standards is more problematic (as it is now in defining proficiency).
Andrew, I'm not sure I caught the practical implications of your last post. Maybe you can clarify?
Craig Jerald: Doug, thanks for those observations about some of the more technical considerations related to scaling, reliability, and validity, which are very helpful to non-psychometricians like me.
I'm going to try to shed some light on Daria's last question: "How much do we know about what PARCC and SBAC will allow for growth that can inform our thinking?" Unfortunately, the answer to even that seemingly simple question is anything but straightforward. I'm sure my fellow discussants are familiar with most of what I'll say, but it might be helpful background for folks out there following the discussion.
The Department's guidelines required that common assessment systems be able to provide "an accurate measure of student growth over a full academic year or course." Later, the application seemed to suggest that growth should be defined as growth-to-proficiency, or in this case "growth-to-college-and-career-readiness," which would suggest the kind of models allowed under the federal Growth Model Pilot Program (GMPP): "produce student achievement data and student growth data that can be used to determine whether individual students are college- and career-ready or on track to being college- and career-ready."
However, when discussing growth measures for school accountability, the Department was not as specific, requiring only that the systems "produce data … including student growth data, that can be used to inform determinations of school effectiveness for purposes of accountability under Title I of the ESEA." And, in the definitions section, the application defined "student growth data" quite broadly as "data regarding the change in student achievement data between two or more points in time."
To make matters more confusing, the Department's definition of the term "on track" seemed to equate the term with a proficiency cutoff, in other words a status score, rather than growth metric, despite the fact that "on track" is a determination based on growth measures in the federal GMPP. The application specified that "on track" must be "demonstrated by an assessment score that meets or exceeds the achievement standard," and in a footnote further clarified that "the term 'on track to being college- and career-ready' is used in place of the term 'proficiency' used in section 1111(b)(3) of the ESEA." On the other hand, that could be interpreted to mean that student growth measures need not be restricted to growth-to-proficiency models, since "on track" is based on a status score, not a growth score—and that is more or less how the consortia interpreted it.
Perhaps understandably, in their applications for funding, the two consortia were quite vague about how they might approach growth measures, although both specifically mentioned an interest in the “student growth percentiles” method championed by Damian Betebenner and currently being used in several states including Colorado. PARCC said, "The Partnership has not at this time determined a particular scale or analytic approach" but that "the Partnership states will review existing state growth measures, such as the student growth percentile method." SBAC said that in addition to information about college- and career-readiness at the high school level and the extent to which students are "on track" to college- and career-readiness in earlier grades, the consortium would produce, "indices of annual growth in student learning (e.g., Colorado Growth Model) that allow for normative comparisons of student gains."
How much have those plans been fleshed out over the past year? It's hard to say. To guard against even the hint of impropriety in procurement (a huge challenge given the novelty and scope of these assessment systems and the intense interest of vendors), both consortia have to be exceptionally careful about when and how they provide information regarding their current thinking about any major design decision, either formally or informally. Furthermore, when it comes to deciding how to measure growth, the consortia are in all likelihood hedging their bets and waiting to get a better read on what the reauthorized ESEA will require—just like everyone else.
I know that's not exactly an enlightening answer, or even a very useful one, but that’s the reality.
Andrew Ho:Doug asked for some clarifications, specifically of my earlier comment, "What component led to the pilot being so variable across states and having, overall, a fairly modest effect?... In sum, I place much of the blame for the vagaries of the pilot results on the accountability models, not the growth models." By variability and vagaries, I was referring to the general findings from the Growth Model Pilot Program's Final Report (Hoffer, et al., 2011, cited earlier), to which Kevin and Craig also referred. The report showed that growth-to-proficiency models made little difference in most states and surprising differences in some. Kevin commented that this led him to conclude that the growth-to-proficiency models of the pilot are not so different from status models and are thus an insufficient growth framework for future policy.
I agree with Kevin, but I'd like to be more specific about what I don't like about the pilot models. Growth-to-proficiency or, more generally, growth-to-standard, is, as Craig just noted, a compelling rhetorical framework, and I don't think we need to throw it out, nor do I expect throwing it out to be possible. It is one of many ways to consider the adequacy of an individual growth trajectory. Now, if you make this a binary adequate/inadequate growth distinction, sum the students making Adequate Yearly Growth, add them to bizarre flowchart that determines Adequate Yearly Progress (AYP) under NCLB, which, of course, varies considerably across states due to confidence intervals, multiyear averaging, safe-harbor calculations, etc., then you lose all the growth information that you once had. What was ineffective and variable about the pilot models wasn't growth-to-proficiency alone but the accountability calculations that followed. As we design new growth models, we obviously shouldn't force them into the AYP decision tree of the past.
I hope that is somewhat clearer.
As to your recent post, Doug, I don't want to derail us into an longstanding technical debate, but, as you know, there is disagreement about the accuracy of calling what you described, "growth." I prefer to describe what you mention as a prediction model, not a growth model, because, really, that's what the model does: it predicts current scores using past scores and, sometimes, other variables as well (student demographics, their teacher, their school). If a student's actual score differs from his or her prediction, this is not growth but status beyond prediction, status beyond expectation, or, more technically, a residual. Damian Betebenner, one of the architects of the Colorado model, just clarified this in his presentation at CCSSO on Sunday, noting that Student Growth Percentiles are not precisely growth but a way of describing current status given past scores.
Is this merely a technical or semantic distinction? Sometimes. Let's say that I sell growth models on the fact that they will be useful to classroom instruction and student learning, and I use a prediction model. These prediction models can be used two ways, 1) to describe whether a student is above or below where he is predicted to be and 2) to describe where a student is likely to be in the future. This is certainly helpful information. But I would also like to have a description, tied to a content domain, of where my student was, where my student is, and what was likely to have been learned in between. This is what some are now calling a learning progression. I think that this latter conception is what most people think of when they hear "growth," and if my growth model doesn't in fact deliver that, I would want to make that clear.
I know Doug knows this distinction, and I am certainly content to call prediction models, "growth models" in most circles. But, as we design policy, I think that it will be important to be clear about what prediction models do and what growth models do. The vast contrasts between trajectory models and prediction/projection models are evidence of this (Ho, 2011, cited previously). If I tell a person on the street to draw what growth looks like, my money says they would draw an increasing trajectory over time, not a deviation from a prediction. The fact that prediction models do not explicitly model that trajectory is worth noting. For examples of what I think of as growth models, see John Willett's work. Of course, if we want to make future predictions or describe deviations from predictions, prediction models work great, and that may be exactly what works best for our policy goals! But that doesn't mean we should describe them in a way that will make a layperson think they are something that they are not. Wait, what am I saying. How else does anything get through Congress? I kid, I kid...
Doug Harris: OK, so I was thinking we might not want to delve into the growth vs. prediction distinction here, which is why I said parenthetically that “Technically speaking, the issue is a little more complicated” and left it at that. One reason I said that, as Andrew pointed out humorously at the end, is that such distinctions will be totally lost in practice. I think the bottom line all of these models are going to be used to measure performance even though none of them, technically speaking, can really claim to reflect the causal effect of a teacher. This is not as much a criticism of growth and value-added as it sounds because no other performance measure—especially what we do now in most classrooms—can claim to be clearly better.
For this reason, I worry about how the Colorado Growth Model is described. We cannot say that a measure is only meant to “describe” student achievement (or achievement growth) and not draw conclusions about performance, and at the same time say that such measures can be useful for school improvement decisions. The information is only useful for school improvement if it tells us something about whether certain teachers and/or programs caused more learning—and that, by definition, is no longer a description. I think the underlying pressure here is that people think we’ve gone too far with test-based accountability and are therefore seeking measures that don’t seem so harsh. “Value-added” doesn’t exactly give a warm and fuzzy feeling (some have told me I should stop using the term) and models that simply “describe” achievement are less threatening, especially when they are shown in nice visual graphs. But whether they really are less threatening depends not on the measure, but the accountability policy that uses it—the penalties, rewards, etc. that get attached to the performance measures. Saying a measure “describes” will be cold comfort when it’s used to fire a bunch of teachers. Conversely, there is no reason to think of value-added as threatening if there are no stakes attached.
I think the consortia are right to be ignoring or side-stepping issues of growth at this point. What gets tested gets done—that is, what’s in the standards and the tests will matter far more for the quality of instruction than how well they meet interval scale and other assumptions related to growth. Focus on the content and then on the assessments to measure that content.
The ambiguities that Craig points out in the Growth Model Pilot are interesting, though I still think the basic concept of growth-to-proficiency, while apparently intended as a compromise between proficiency and growth, is essentially proficiency in disguise. Status measures are inaccurate because they hold schools accountable substantially for the low achievement some students bring with them to the classroom on day one—outside the control of the educators in that school. Growth-to-proficiency, in its basic form, does the same thing by requiring schools with students who start far behind to get those students learning at a fast rate, even though the same factors that created their low achievement to start with are still affecting them outside the classroom. Real growth measures don't make that mistake.
Again, I'm not saying we should ignore proficiency, but let’s be transparent about it. Let’s create the proficiency and growth measures separately and use them together in the ways we've talked about this week. Growth-to-proficiency is not the compromise it appears to be.
It looks like this may be the end of the road for our discussion. It was very nice talking with everyone and thanks to Kevin and Education Sector for hosting.
Craig Jerald: I’d like to express my gratitude to Kevin and his colleagues at Education Sector for hosting this discussion and thank my fellow discussants for their vigorous and thoughtful participation. It's been tremendously stimulating, and I’ve learned quite a lot. I’ll just conclude with some final thoughts.
Regarding Doug’s and Andrew’s recent exchange about how so-called “projection” or “prediction” models are different from other growth models: I’m hoping that distinction won’t be lost on policymakers as you both predict. Earlier this year, after I heard Andrew mention the controversy that erupted over the recently-discarded Texas Projection Model, I spent a few hours pulling up news clippings about it. It’s a fascinating and informative case study of how the public can turn against policymakers if they are not clear about what a so-called “growth model” actually does and what it does not do. Even when critics conceded that that the model was a pretty accurate predictor of students’ future performance, they still called for its discontinuance based on the fact that it really wasn’t measuring “growth” at all. They had been sold on growth as a way of evaluating schools, they understood the concept of growth, and they felt betrayed that the model didn’t actually measure it—regardless of how accurate its predictions were. (If anyone out there is interested, I'd be happy to share the Word document where I dumped all of those clippings. Just e-mail me.)
Yet the predictive capability of such models might be valuable for other purposes. For example, when selecting measures for targeting intensive interventions to individual students, predictive accuracy becomes quite important because of the high cost of false negatives and false positives. Similarly, while the “student growth percentile” measure might not entirely comport with our intuitive understanding of “growth,” the individual student reports that Colorado has developed with Damien Betebenner seem incredibly useful for parents and teachers. Finally, the concept of “on track” is not going to go away. It’s deeply embedded in the DNA of the consortia’s plans for new common assessment systems because of the great emphasis the Department placed on it in framing the Race to the Top Assessment Program. So it will be useful to continue to think about how this loose collection of so-called “growth-to-standard” models might be deployed responsibly for certain purposes, even if I agree with all of you that such models are way too problematic to use as some kind of catch-all, single “compromise measure” to drive school-level accountability.
Yesterday Representative Kline told Education Week that he plans for the House Committee on Education and the Workforce to turn to issues of teacher effectiveness in September and to issues of school accountability immediately after that. So as my final thought, I’d like to reiterate my hope that the quickly approaching legislative debate about school accountability can be reframed to focus much less on distributing “rewards and punishments” and more toward the goal of increasing every school’s capacity to build on strengths and address weaknesses while protecting our most vulnerable students from attending schools where they cannot prepare for success in college and careers. I believe that would help everyone move beyond the endless argument about measures, none of which are perfect, and focus more attention on the critical but—as Kevin rightfully pointed out—too often neglected issues of interpreting and acting on information. I have seen sophisticated teacher evaluation systems that rely on virtually the same set of “multiple measures” (including student growth) but which provide teachers with vastly different levels of encouragement and support to continuously improve their practice and their students’ learning. I think that if we can reframe the goals for accountability, the conversation will really open up and a whole new set of possibilities will emerge.
Doug Harris: Since I'm the one mainly talking about this in terms of rewards and penalties, I just want to point out that I completely agree with Craig and Kevin the measures can and should be useful for school improvement beyond that. I write a lot in my book about combining value-added with more formative assessments, for example. But that formative piece is harder for the feds to directly affect. The hard questions for ESEA are still going to be, what do we do with schools that are persistently low-performing (on whatever measure we choose)? And how can we reward those that perform well (something sorely missing from the current system)? If they can come up with good answers to those, then the incentives can provide the initial nudge for educators to use the measures in more formative ways—finding specific ways to get better. The incentive side and the formative/school improvement sides are closely connected.
Craig Jerald: Doug, that's a helpful clarification. I should say that I'm not so naive to think that incentives won't be part of the conversation about accountability moving forward. I just hope that the other goals we've mentioned can be recognized as equally important because framing matters and ends up playing a huge role in decisions about what kinds of mechanisms you build into any system like this and the tradeoffs you have to make along the way. Where I do strongly agree with you is that there should be very tight alignment between the incentives side and the formative/improvement side of any accountability system, and the next version of ESEA should strongly encourage such coherence.
I guess that leads me to one last observation, which is that although some of us disagreed here and there on certain issues during this week's discussion, I'm surprised by how much we ended up agreeing on some pretty fundamental ones. And that gives me a great deal of hope.
Doug Harris: Yes, I think there is a lot of agreement on how to move forward—among us and beyond. That's why it's so sad that the re-authorization is so slow.
Kevin Carey:This has been a terrific discussion. On behalf of Education Sector, thanks to all of you for your careful, detailed thoughts and observations. I'm sorry I wasn't able to contribute more to the discussion over the last couple of days but I did want to make one (unfortunately, somewhat pessimistic) observation about "college and career readiness."
Speaking as someone who spends the majority of his time swimming in the postsecondary end of the education pool, I'm gratified to see a consensus build around college and career readiness as the conceptual "anchor" of our standards and accountability goals. Practically speaking, the discussion has mostly been about "college ready," which is not surprising given that the large majority of high school graduates go to college, and among those who don't, some enter jobs for which academic preparation is less important. Education Sector released a report last year about using evidence of student progress (or lack thereof) in college to improve high school accountability and we're currently working with some folks in California to similar ends. So we're definitely on board with the concept.
BUT, I don't think the K–12 community properly appreciates how slippery a concept "college ready" really is. The most common definition seems to be "not forced to take remedial classes." I think this is one level too low--college readiness doesn't just mean ready to take college-level courses for credit, it means ready to succeed in college level courses for credit. But that aside, the huge challenge here is that colleges have vastly different academic standards—almost as different as they could possibly be. You see this first in the remediation process, where "cut scores" on standardized assessments like Accuplacer and Compass range all over the map, for seemingly arbitrary reasons. But it really shines through when you examine broader measures of students themselves.
Take the Collegiate Learning Assessment, the instrument used in Richard Arum and Josipa Roksa's much-discussed "Academically Adrift" report published earlier this year, which found alarmingly little student learning in college. The CLA is administered to freshmen and seniors at hundreds of colleges and universities. The results show very clearly that the average freshman at some more selective colleges—not Ivy League schools, but public universities closer to the R1 level—has greater facility in critical thinking, analytic reasoning, and communication than the average graduating senior at other colleges. Students are getting bachelor's degree from some schools who couldn't even measure up to freshmen at other schools. The idea that higher education writ large maintains some kind of collective academic standards via accreditation or any other process is demonstrably false. It doesn't. Indeed, standards often vary tremendously inside of colleges, among schools, departments, courses, and even sections within courses. It is a vast, unruly, and highly decentralized system that has never grappled with the task of defining academic standards in any kind of rational or collective way.
So if we're going to tether our K–12 accountability regime to "college ready" standards, as it appears we are, it's going to be very important to decide which college ready standards we're talking about. And that in turn will require a level of engagement with colleges themselves that so far hasn't occurred.
Andrew Ho: I've really enjoyed this conversation as well. Thank you for asking me to participate, and thank you all for your ideas, which have pushed my thinking in productive ways. I'll just make one quick closing rejoinder. Craig and Doug, I hope my little wisecrack at the end of my last post didn't undermine what I think is a very serious distinction between prediction and growth. And Craig, I don't predict that it will be lost on policymakers... if I (and we) have anything to say about it! My recent papers and presentations have been trying to make these distinctions as vivid and consequential as possible. And Doug, I agree about incentives — that's been precisely the framework that's proven most helpful, and it's one I think all of us have appreciated through this discussion. I don't think I'm as good at speaking to policymakers and practitioners as I need to be, but conversations like these really help, and I've appreciated the opportunity. Thanks again to Kevin, Renee, and their Education Sector colleagues, and thanks to you all for reading. I look forward to future conversations and collaborations.
This discussion has concluded. Join us for another online discussion soon!