Math Nerd

Closing the Teach For America Blogging Gap
Sep 14 2012

More about Value-Added Measurements

A response to Gary Rubinstein’s post on value-added.

* * *

Gary,

I can see why you would be worried that policymakers and the media have overstated the usefulness of one-year VAM scores in teacher evaluations, but it’s still not clear to me why you don’t find the Chetty results, which suggest that accurate VAM is possible, compelling in and of themselves.

The study (available here) shows a pretty tight correlation between teacher value-added and actual student test scores (see Figure 1a).  There are also positive correlations between between teacher value-added and earnings and the longer-term outcomes (college attendance at 20, college quality at 20, earnings at age 28, teenage births, and neighborhood quality) although these correlations are not as strong (see Figures 5a, 5b, 6, 8a, and 8b).  I don’t see any correlation coefficients published and I don’t have the statistical chops to analyze the data myself, so I’m just looking at the graphs; let me know if there’s something wrong with my conclusions.

It makes sense that the long-term outcomes are less tightly correlated; it’s unrealistic to expect any one teacher to completely change all of his or her students’ lives forever.  The Counterpunch article you link makes it seem like the study’s conclusions are bogus because there is no statistically significant link between a teacher’s value-added and a student’s income at age 30.  But the signal between these two things is going to be very, very weak – you’re going to need a lot of observations to be confident it’s there.  There were 368,427 observations at age 28 and only 61,639 at age 30.  That’s still a large number, but it’s a big decrease.  According to the authors, the effect of teacher value-added on earnings is in fact higher at age 30 than at age 28 – “The correlation between test scores and earnings is roughly 20% higher at age 30 than at age 28″ – but the standard error is also much greater, partially because the sample shrunk so much.

In my opinion, the Chetty study convincingly demonstrates the validity of value-added measurements.  Even Diane Ravitch initially wrote, “[the] problems of the study are not technical, but educational” (though she has since changed her mind). Matthew DiCarlo of the AFT wrote, “[there] is some strong, useful signal there”. Like commenter Paul Bruno, I haven’t seen a convincing scholarly repudiation of the study itself.

Now, you can definitely argue that value-added measurements are being incorrectly used in teacher evaluations, although I wouldn’t necessarily agree with you.  The Chetty study had an mean of 8.08 years of data per teacher (although the standard deviation is almost as high, 7.72 years), and the researchers generated each teacher’s value-added score by combining all those years together.  As you saw in your analysis of the NYC data, there can be HUGE fluctuations in a teacher’s students’ test scores from year to year and even from classroom to classroom. However, Chetty demonstrates pretty convincingly that value-added is not ‘completely inaccurate’. It’s just wildly imprecise. So how many years of data do you need to be able to make a confident assessment of a teacher’s value-added? I don’t know, and it’s going to vary from teacher to teacher, but we shouldn’t discard the measure just because it’s not perfect.

I do think that we should use value-added, perhaps weighting it less at the beginning of a teacher’s career and more further down the line as there is more information to draw from.  I’m not aware of anyone who wants to hire and fire teachers based on test scores alone.  In Chicago, for example, the current proposal is that standardized tests be worth 15% of classroom teachers’ scores (see p. 46 of the proposal) and be thrown out if their confidence interval is too large (p. 38).  A teacher with the minimum VAM score could still be rated ‘Excellent’ if he or she did well on all of the other measures.

I know that CPS wants the weighting of VAM to increase over time, and I can see how you would be worried about that, too.  I can see how teachers, worried about their evaluations, could narrowly focus on test-prep.  I think it’s an incredibly misguided strategy, but that’s beside the point – it will happen anyway.  I think the best way to combat that is with (1) high-quality coaching to help teachers increase their VAM by employing better teaching techniques, and (2) frequent evaluations by trained observers.  If every observer who comes in says, “Teacher X gave her kids a test-prep packet and told them to spend the period working on it”, then red flags should appear elsewhere in that teacher’s evaluation, providing a disincentive to go test-prep crazy.  I don’t think it makes sense to say, “tying VAM to evaluations will completely distort VAM measurements, so we shouldn’t do it at all.”

Now, I’ve said all this, but I realize we may still have a fundamental disagreement.  You say that value-added is not ‘student learning’, ‘student achievement’, or ‘student growth’.  Does that mean you think standardized tests are completely worthless as proxies for these things?  If so, how do you think we should measure them? Or do you not think we should try to measure them at all? If that’s the case, how should teachers figure out what they’re going wrong and get better?

Thanks for your post.  I enjoyed reading it and taking the opportunity to learn more about VAM.

7 Responses

  1. meghank

    Here Gary responds to corporate reformers who claim that teachers whose evaluations do not include test scores are not being evaluated on “student learning”:

    http://garyrubinstein.teachforus.org/2012/08/12/thatll-learn-em/

    Also, you are incorrect in stating that Diane Ravitch agrees that value-added is valid. You took her quote out of context. Just recently, she posted a statement on her blog about the invalidity of value-added.

    • Katrina

      meghank,

      I don’t think Gary’s article disproves the usefulness of standardized tests. His article centers around the example of a student in a juggling class. The goal of the class is for the student to be able to juggle three balls. The student starts with just one ball and practices foundational skills (like tossing that ball from one hand to the other). At the end of the course is a state test requiring the student to juggle all three balls. The student can juggle them for 5 seconds. However, the value-added model predicted he should have been able to juggle them for 30 seconds. His teacher is therefore ranked poorly on her evaluation.

      Gary alleges that this is unfair because it fails to take into account all the foundational skills the student did learn and because the student could be very close to that moment where it all ‘clicks’ and he is able to juggle successfully.

      I agree that a standardized test with just one type of question is not a very good one. It’s not very helpful to give kids a binary pass/fail assessment – there’s not a lot of information there. Ideally, the test would have easier questions (like “throw one ball back in forth” and “juggle two balls”) to see how well the student mastered the foundational skills. It would also have harder questions (like “juggle four balls”) to see if the student had reached beyond the class’s basic objective.

      But just because the test isn’t perfect doesn’t mean that it doesn’t provide any information about student learning. The goal of the class was to learn to juggle three balls, and the test provides information about whether the student mastered that skill. That means it provides information about student learning in that class. We wish it provided more, but this is still better than nothing.

      To extend this analogy and talk more about VAM, pretend the student – let’s call him Joe – is at a circus school. For simplicity, pretend all of the other students are exactly like Joe in terms of demographics, prior performance, etc. Pretend there are ten three-ball juggling teachers at the circus school. The school has kept extensive records over time, and has found that the average student can juggle three balls for 30 seconds at the end of the three-ball course (this is the situation Gary implies in his scenario).

      It might be that Joe is the only student in his class who can only juggle for 5 seconds; all the others are around the 30-second average. If this were the case, his teacher’s value-added score wouldn’t be affected much. But what if his entire class could only juggle for 5 seconds, when the average student in the other classes could juggle for 30? This should be a red flag for Joe’s teacher that something is wrong. She might say, “Huh. I’m working my butt off here to get teach these kids to juggle, but somehow the other nine teachers in my school are getting more of their kids to ‘get it’. Their kids are exactly the same as mine. Why are they getting different results? What are they doing differently? I should go watch their classes to see what they do.” This is why standardized tests are useful.

      * * *

      As to your allegation that I took Diane Ravitch out of context, I must respectfully disagree, although I thank you for pointing out that her views have changed since writing the article I linked to – I’ve amended my post. Here is a more complete version of the quote I included:

      “Here are some obvious conclusions from the study: Teachers are really important. They make a lasting difference in the lives of their students. Some teachers are better than other teachers. Some are better at raising students’ test scores.

      The problems of the study are not technical, but educational.

      The Chetty-Friedman-Rockoff analysis points us to an education system in which tests become even more consequential than they are now. Teachers would work in school systems with no job protection, and their jobs would depend on the rise or fall of their students’ test scores. ”

      In the article I linked (dated 1/17/12, shortly after the study came out) Ravitch did not contest the validity of the Chetty analysis; she only said that implementing its method to evaluate teachers would lead to a system narrowly focused on raising test scores. She may have changed her mind since then, but it really is what she said at the time.

      Thanks for commenting!

      • meghank

        I appreciate that you amended your post so as not to misrepresent Diane Ravitch’s views.

        I simply linked Gary’s post in answer to your question to him about student learning, as it seems unnecessary that he should have to say again in a comment what he already said clearly in a very good blog post.

  2. yoteach

    This was a great summary of the value-added debate, thanks! I share your optimism that while VAM is imperfect, it seems like the strategies you enunciated are a reasonable direction to make it better. And clearly, all sides of the debate are using standardized test scores to advance their particular argument, so it is a bit silly for one side to argue that these tests don’t measure anything. Yes, Ravitch uses one particular test that doesn’t have incentives tied to it (as well as PISA), but that implies that tests, utilized appropriately, can measure something substantive. I work at a school now where “independent reading” is something fourth graders only get to do when they finish an assignment early (if you buy a class library of course). It’s sickening, but I blame short-sighted administration more than I do the existence of standardized tests. Thanks for this!

    • Katrina

      Thanks! I agree that this debate gets kind of silly sometimes. It’s really sad that your administration doesn’t value independent reading. Most teachers I know who work in a struggling school talk about how incompetent their leaders are. I really hope we can find ways to improve administrators’ effectiveness. It’s a critical piece of the puzzle, and one teachers are rightly frustrated about.

    • meghank

      The test appears to be my principal’s reason for living. In today’s environment, many administrators, like mine, live and die by those tests.

      What I’m saying is, while administrators often make bad decisions, the reason we no longer have many individuals with empathy for students in administrator positions, is because these principals were driven away by the testing-obsessed culture of our schools. Those administrators that remain often make bad decisions — although all decisions they make, to their mind, promote higher test scores.

  3. I greatly appreciate all the info I’ve read here. I will spread the word about your blog to other people. Cheers.

Reply to Katrina

or, cancel reply

About this Blog


Subscribe to this blog (feed)


Archives

Categories