While I was reading about VAM last week, I came across Jesse Rothstein’s harsh criticism of the MET Project’s *Learning About Teaching*. It was the first academic criticism I had seen of a VAM study, so I decided to dig into both the original MET report and his review.

# The MET Paper: Background

The MET Project is a Gates Foundation-sponsored initiative to find effective ways to evaluate teachers. The project’s participants are looking at value-added scores, student surveys, tests of teachers’ pedagogical content knowledge, and classroom observations as potential ways to measure teacher effectiveness. They are planning to issue a final report sometime in 2013. In addition to publishing the Learning About Teaching paper, MET published a paper on observation systems in January 2012.

*Learning About Teaching* came out in December of 2010 and claimed to validate the use of value-added modeling and student perception surveys to evaluate teachers. I remember reading a favorable review of the study in in the New York Times when it first came out; the LA Times also published a positive article.

The MET paper highlighted four findings:

- In every grade and subject, a teacher’s past track record of value-added is among the strongest predictors of their students’ achievement gains in other classes and academic years. A teacher’s value-added fluctuates from year-to-year and from class-to-class, as succeeding cohorts of students move through their classrooms. However, that volatility is not so large as to undercut the usefulness of value-added as an indicator (imperfect, but still informative) of future performance.
- Teachers with high value-added on state tests tend to promote deeper conceptual understanding as well.
- Teachers have larger effects on math achievement than on achievement in reading or English Language Arts, at least as measured on state assessments.
- Student perceptions of a given teacher’s strengths and weaknesses are consistent across the different groups of students they teach. Moreover, students seem to know effective teaching when they experience it: student perceptions in one class are related to the achievement gains in other classes taught by the same teacher. Most important are students’ perception of a teacher’s ability to control a classroom and to challenge students with rigorous work.

# The MET Paper: Criticism

Unfortunately, the MET authors were so excited about value-added that they completely bungled their analysis. I agree with most of Rothstein’s criticisms, which include:

**“[T]here are troubling indications that the Project’s conclusions were predetermined”**

On the whole, the report reads more like an advertisement for value-added than a serious scholarly paper (it wasn’t published in a scholarly journal, so maybe the authors didn’t see this as a problem). It seems like the researchers used their data when it was convenient to support their points instead of drawing their conclusions from their data. For example, the recommendations in their Conclusions section are good ideas, but have nothing to do with the study. This is sad, because the researchers collected some interesting data.**“The report’s main conclusion [#1 above] is not supported by plausible analysis”**

The only two variables the MET researchers evaluated were value-added and student perceptions. They have no basis for saying “a teacher’s past record of value-added is among the strongest predictors of their students’ achievement gains” because they did not examine other potential predictors.**The MET researchers incorrectly equated correlation with causation**

Ah, Statistics 101. Throughout their paper, they use the phrase “teacher effectiveness” to describe the relationship of predicted to actual student scores, implying that the teacher’s ability (or lack thereof) is what caused the difference in their students’ performance. But this was an observational study, and observational studies cannot verify causation. The fact that the next phase of the MET study involved random assignment of students to classrooms implies that the researchers were aware of this limitation, but they decided to use “teacher effectiveness” anyway. In this post, I’ll use “enhanced value-added” or “EVA”. The “enhanced” refers to the fact that the MET project’s evaluation model includes student survey data in addition to traditional value-added on state tests.**“The report focuses on the new MET Project data and does not review the literature.”**

This may seem like scholarly nit-picking, but it’s important. If the MET researchers had engaged with the prior literature on VAM, they could have discussed and addressed others’ concerns about using VAM to evaluate teachers. Chief among these is Campbell’s Law: “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor”.*Learning About Teaching*asserts that state test scores are the sole best measure of a teacher’s ability, and Rothstein fears policymakers will take this to narrow schools’ focus onto test scores even more. But the larger goal of the MET Project is to assess*multiple*measures of teacher effectiveness. If the MET researchers had talked about Campbell’s Law (and other concerns about high-stakes standardized tests) they would have placed the paper’s results in context more effectively.**The data on predicting a teacher’s value-added “are presented in an unusual way that makes it difficult to assess the strength of the predictions”.**

This is a big one. Basically, the MET researchers had a ton of interesting data, but they didn’t bother to include a meaningful analysis of it in their report. The prediction information is key to assessing the usefulness of the value-added and student survey data MET gathered. MET estimated a teacher’s EVA based on her value-added and student survey scores, then compared that evaluation with her EVA in another class section (Table 9) or school year (Table 10). A low correlation would imply that the MET evaluation couldn’t accurately measure EVA, since it should be relatively stable between groups of students.Unfortunately, they didn’t bother to publish correlation information, instead publishing the difference in mean EVA of teachers who, with a different group of students, scored in the top and bottom EVA quartiles. All this shows is that, in general, a class with a high-EVA teacher will score better on the state test and in student surveys than a similar class with a low-EVA teacher.

This is important because it suggests that teachers vary (in their abilities or in other unobserved ways), which means that we shouldn’t treat them all exactly the same way as employees. That’s pretty much what we do now, and the report offers a fairly convincing argument for changing that by showing that the students of top-quartile-EVA teachers can expect to learn about 3 months’ more material in ELA and 7.5 months’ more material in math than similar students with bottom-quartile-EVA teachers. Those are big differences, and they lend urgency to efforts to find, learn from, and reward the most successful teachers and support (and eventually, if they don’t improve, dismiss) the least successful ones.

But the project’s analyses don’t tell us if we can accurately assess individual teachers, which is crucial if we expect to include EVA in teacher evaluations. We’ll talk more about that in a minute. First, we need to talk about state tests.

**“Value-added for state assessments is correlated 0.5 or less with that for the alternative assessments, meaning that many teachers whose value-added for one test is low are in fact quite effective when judged by the other.”**

The claims above only relate to the students’ performance on state tests – and the MET report’s own data suggests that state tests are insufficient as a sole metric of student success. If you’re a teacher, you probably just said, “duh… why did they need a study to prove that?”. Well, now you have data to back you up!The researchers administered alternative, more conceptual assessments – for math, the Balanced Assessment in Math, and for ELA, the Stanford 9 OE – to determine whether strong student performance on the state test was related to strong conceptual understanding. In math, the correlation between a teacher’s value-added on each of the two tests was 0.54; in ELA, it was 0.37, or 0.59 if some potentially problematic data were excluded. This means that, in math, 54% of the variation in students’ achievement between the two tests was due to the teacher. If the state test were a perfect measure of everything that students needed to know, the correlation would be close to 1 – students who scored well on the state test would also score well on the conceptual test, and vice-versa.

The MET authors comment that these correlations are “moderately large” and imply that state tests are a decent all-around measure of student learning. Rothstein and I both disagree; we think the correlations would be higher if the state tests thoroughly measured student understanding.

Rothstein uses these not-great correlations to lambaste the idea of using value-added at all (“the correlations between value-added scores on state and alternative assessments are so small that they cast serious doubt on the entire value-added enterprise.”). Personally, I think this is a far too narrow reading of the issue. Yes, they cast serious doubt on the efficacy of using state assessments as the dominant tool for evaluating teacher performance. But they don’t mean we should completely give up on value-added modeling. Easily-testable knowledge is one part of student learning, so it should have a place as one component of our model. The more measurements a model incorporates, the clearer a picture it gives.

The MET researchers knew this, and they enhanced the regular state-test-only value-added model by adding student perception data. What I really would like to see is a model that combines a teacher’s value-added on the state test with his value-added on the alternative assessment. Which teachers had students that did very well (or very poorly) on

*both*tests? Knowing that would give us a better idea of who the really good – and not so good – teachers are.

# Using the MET Data to Evaluate Individual Teachers

One of the reasons value-added draws so much ire is that its results vary so wildly that they feel meaningless when they’re good and unfair when they’re bad. Rothstein cites the example of a teacher in the MET study whose predicted value-added for state ELA scores is in the 25th percentile: this teacher “is actually more likely to fall in the top half of the distribution than in the bottom quarter!”

The MET Project used data from teachers whose EVA was measured with more than one group of students to get an idea of how good each EVA measurement was. If the measurements were perfect – and therefore represented a teacher’s true EVA – each teacher’s EVA would be exactly the same in each of his or her different classes.

We can visualize this data by grouping teacher EVA in the teacher’s first class by quartile, then comparing EVA in a second class to the number of teachers with that EVA in each group. The MET’s Figure 1 is a version of this, although it leaves out the middle two quartiles.

This is an example of what such a chart would look like if the EVA measurements were perfect:

In the world depicted in the chart, if you know a teacher’s EVA in one class, you can say with pretty much 100% certainty whether the teacher’s EVA will be in the top, middle, or bottom quartile in any other class of students he might be assigned to. This is because there is no overlap between any of the groups. If a teacher’s EVA is +0.2, you can look on the chart and see that every single teacher with that EVA scored in the uppermost quartile in his or her other class. So you can confidently predict that this teacher, too, would fall in the top quartile if he taught another class.

On the other hand, if EVA measurements were completely invalid, the chart would look like this:

In the world depicted in this chart, knowing a teacher’s EVA in one class doesn’t help you predict what that teacher’s EVA would be in a different class. It doesn’t mean that there aren’t some high-EVA teachers and some low-EVA teachers, but it means our measurements are so poor that we can’t tell who is who. This is because all of the curves for the different quartile groups lie on top of each other – they overlap.

To use statistical terms, the correlation between the two datasets in the first graph is 1. In the second graph, it’s 0. Rothstein estimated that the MET project’s data showed a correlation between teacher EVA in one class and value-added on the state math test in another class of 0.327 – not particularly high. We can use a quick and dirty estimation (assuming normal distributions) to visualize this. Here’s what it would look like:

As you can see, it’s pretty hard to predict which quartile a teacher’s VA on the state math test would fall into based on her VA in her first class.

The chart below shows the probability that a teacher whose EVA is in a given percentile would fall in a particular quartile of VA on the state math test with a different section of students:

The height of each band at a given point is the probability that a teacher whose EVA score fell in that percentile would have a VA in the top, middle 2, or bottom quartiles if she taught another class. For example, a teacher with a score in the 20th percentile would have about a 33% chance of falling in the bottom 25%, a 50% chance of falling in the middle 50%, and a 17% chance of falling in the upper 25% of value-added scores on the state math test.

This is what Rothstein was referring to in his analysis when he implied that value-added modeling was worthless. If a teacher who is evaluated to be in only the 20th percentile of “teaching ability” actually has a 17% chance of being scoring in the 75th-100th percentiles with a different class, then it seems like giving her a score of 20/100 is unfair.

I agree with him that the data presented in the MET report does not provide a meaningful enough picture to accurately measure teacher EVA. But, as I mentioned above, we could improve the model by adding more information. The MET’s EVA model included two pieces of information: value-added on the state test and student perception feedback.

A better model would have more inputs. For example:

The MET researchers are studying the inputs shown in yellow above. *Learning About Teaching* may have focused narrowly on state test scores and student surveys, but the overall goal of the MET Project is to identify which inputs are most useful in evaluating teacher EVA. Its director, Thomas Kane, published a comprehensive explanation if its purpose in the Fall 2012 edition of *Education Next*.

Incorporating more inputs has two advantages: one, it reduces the effects of any measurement errors (since a variety of measurements are used to calculate the score), and two, it minimizes the incentive to narrowly obsess over any one measurement (like state test scores). With fewer measurement errors, the correlation between measurements of a teacher’s EVA in two different classes would go up, making it easier to predict teacher EVA in a different section. For example, here are sample probability curves for r=0.7:

Now, a teacher whose EVA in one class is in the 20th percentile has a 43% chance of falling in the bottom 25%, a 56% chance of falling in the middle 50%, and barely a 1% chance of falling in the upper 25% of EVA in another class. This is pretty good information – now we know that it’s unlikely she’s among the highest-EVA teachers, and there’s a relatively large chance that she’s among the lowest.

A teacher looking to maximize his EVA score would have to ensure his students did well on both state and conceptual assessments, earn the praise of school leaders, raters, students, and parents, and find ways to increase his students’ health, happiness, and character. In other words, he would have to be a great teacher!

Incorporating multiple measures of success in an EVA model would create the “balanced set of measures that are relatively unsusceptible to manipulation and gaming” that Rothstein himself advocates. We don’t have tools to measure all these areas right now, but rather than argue over the validity of measurement in general, we should set to work creating them. As has been noted in other fields, we value what we measure at the expense of what we do not. Businesses that measure only profits will be inclined to exploit their employees and destroy the environment. Professors who are chosen for tenure based on their research results will rarely be excellent lecturers. Countries that measure their success in terms of GDP growth will become more materialistic (did you know that Bhutan measures Gross National Happiness instead?).

In education, leaders cling to standardized test scores not because they’re the best measurement, but because they’re the only measurement. If you’re a teacher and you rail against additional tests and unannounced observations, you’re denying yourself the opportunity to be evaluated more fairly.

# In Summary

- The MET researchers did a crummy job of analyzing their data in
*Learning About Teaching* - Rothstein offers a thorough, valid critique of the mistakes that they made. However, he overreaches in suggesting that all value-added modeling is a bad idea.
**Quantitative modeling of teacher performance is possible, but it should incorporate measurements beyond state test scores and student feedback surveys. This has two advantages: one, it makes teachers’ evaluation scores more accurate, and two, it encourages teachers to improve all areas of their practice rather than narrowly focusing on state tests.**

*Note: I used R to generate the graphs in this post; code is on github.*