Study details how to evaluate teachers

Turns out Colorado’s law mandating that half a teacher’s evaluation be based on student growth falls in the sweet spot of what works when crafting a balanced approach to assessing a teacher’s ability, according to research released Tuesday.

The Measures of Effective Teaching (MET) project, a three-year study designed to determine how to best identify and promote great teaching and funded by the Bill & Melinda Gates Foundation, found that student achievement gains (measured by standardized tests and other measures) should account for either 33 or 50 percent of a teacher’s evaluation.

“Our data suggest that assigning 50 percent or 33 percent of the weight to state test results maintains considerable predictive power, increases reliability and potentially avoids the unintended negative consequences from assigning too-heavy weights to a single measure,” the much-awaited MET study found.

But the Gates Foundation doesn’t weigh in on a single way to conduct evaluations.

“We are not trying to make local decisions or solve local political arguments,” said Vicki Phillips, director of education for the College Ready U.S. Program at the Bill & Melinda Gates Foundation, in a conference call with reporters Tuesday. “You can clearly see different ways you could add these things up. You can see the trade-offs you would make.”

Colorado Senate Bill 10-191 requires half a teacher’s evaluation to be based on multiple measures of student academic growth. The law goes fully into effect in 2014 and will have ramifications for a teacher’s tenure status. Many districts are already piloting innovative teacher evaluation systems that stretch beyond the old employee reviews that emphasized seniority and educational attainment above all else.

Multiple measures key

The MET study, of which Denver Public Schools was a part, also emphasizes the importance of a balanced evaluation that also takes into serious consideration classroom observations and student surveys.

“Looking at multiple measures of achievement and a balanced approach seems to be most effective,” Phillips said, noting that any evaluation system must also include resources to help teachers improve.

“Teachers get wary when it doesn’t lead them to improvement,” she said. “Teachers want feedback that leads to great development.”

While no high-stakes teacher evaluation system is perfect, MET researchers pointed out that the thoughtful measures and approaches detailed in the MET study are better than anything on the books now.

Researchers noted that “the combined measure is better on virtually every dimension than the measures in use now. There is no way to avoid the stakes attached to every hiring, retention, and pay decision. And deciding not to make a change is, after all, a decision. No measure is perfect, but better information should support better decisions.”

Denver Superintendent Tom Boasberg said the MET project’s findings offer insights that can be put to immediate use in classrooms and form “a roadmap that districts can follow.”

“Great teaching is the most important in-school factor in determining student achievement,” Boasberg said in a statement. “It is critical that we provide our teachers with the feedback and coaching they need to master this very challenging profession and become great teachers.”

The MET project is a collaboration between dozens of independent research teams and nearly 3,000 teacher volunteers from seven U.S. public school districts. DPS received $880,000 two years ago to participate in MET.

DPS teachers who volunteered to participate in the study were among 3,700 nationwide who were videotaped and whose teaching styles were held up and examined through every possible lens. Their students also filled out surveys about their classroom experiences. The study was limited to English and math teachers in fourth through eighth grades and English, algebra and biology teachers in ninth grade.

Denver is also the recipient of a $10 million Gates grant focused on revamping its evaluation system. The resulting system, called LEAP, put Denver on the cutting edge of innovative teacher evaluation systems in Colorado.

Random student placements

In the first year of the MET study, teaching was measured using a combination of student surveys, classroom observations and student achievement gains. The interesting twist happened in year two when teachers were randomly assigned to different classrooms of students. That eliminated any bias from assignment of high-performing students to certain teachers and lower-performing students to others, which could skew evaluations.

The students’ skills and knowledge were later measured using state tests and supplemental assessments designed to measure conceptual understanding in math and an ability to write short responses to reading prompts.

The teachers whose students fared better during the first year of the project also had students who performed better following random assignment. And, the size of the achievement gains aligned with the predictions.

“This is the first large-scale study to demonstrate, using random assignment, that it is possible to identify great teaching,” according to the news release about the MET study.

“We found more effective teachers not only perform better on state tests, but also on more cognitively challenging assessments,” said Steve Cantrell, chief research officer for the Education, College Ready office of the Bill & Melinda Gates Foundation.

However, the methodology was not foolproof.

“Within every group there were some teachers whose students performed better than predicted and some whose students performed worse,” the researchers found.

Master’s degree, experience don’t add value

The research failed to find a correlation between quality teaching and whether the teacher had a master’s degree or higher credential or lots of  classroom experience.

Those factors “predicted a third as well” as the other measures, Cantrell said.

“On every student outcome – the state tests, supplemental tests, student’s self-reported level of effort and enjoyment in class – the teachers who excelled on the composite measure had better outcomes than those with high levels of teaching experience or a master’s degree,” researchers found.

The study also concludes that a district must carefully balance different types of evaluation tools and that the data used must be reliable and complete.

“Measures have to be validated on an ongoing basis,” Phillips said. “High quality data requires reliable measures, and builds trust … Assuring accuracy is really vital. It requires not a one-time shot, but ongoing monitoring and training.”

“The state and district data systems are not always able to track students as they move. You need to make sure you’re attributing the right students to teachers.”

If a state places 50 percent of a teacher’s evaluation on student achievement, then 25 percent should be based on peer evaluations and 25 percent on student surveys, the study found. Or, districts should consider dividing observations, student performance and student surveys evenly in threes.

Peer evaluations by multiple observers

As for peer evaluations, MET researchers determined the process is more reliable if several different people evaluate the same teacher – even for shorter periods over a school year – rather than one evaluator making several trips to the same classroom. Researchers said video can be a cost-saving way to assess lessons so they can be reviewed by several people with results being averaged.

However, it is key to observe someone during several different lessons.

“Teachers have some good days; and they have some bad days,” Cantrell said. “We want to see multiple lessons so you are more likely to come to judgment about a typical day in that teacher’s life.”

Also, it is possible for someone with a prior relationship with the teacher to conduct a fair observation with the proper training.

“It is possible for school personnel to produce valid observations – even for teachers they know,” Cantrell said. “You average the judgments of two or more observers, not just the school principal.”

As part of Denver’s LEAP (Leading Effective Educator Practice) program, 45 trained peer observers were hired to conduct regular visits to DPS classrooms because teachers raised concerns that observations by principals alone could be biased and reflect personality conflicts. In Denver, the same peer observer continues to observe the same teacher for a set period.

The MET study also found that “although school administrators rate their own teachers somewhat higher than do outside observers, how they rank their teachers’ practice is very similar and teachers’ own administrators actually discern bigger differences in teaching practice, which increases reliability.”

As for observation rubrics, Denver ended up paring its down when teachers complained. As it turns out, the Gates study backs up a more concise feedback form.

“Longer rubrics that include more competencies actually require more training,” Cantrell said. “It is more difficult to achieve reliability … (if there is) such a heavy cognitive load on the classroom observer.”

However, researchers also confirmed that observation alone is not a valid measure of teacher effectiveness.

Guiding principles

The Gates foundation has developed a set of guiding principles that states and districts can use when building teacher evaluation systems. These principles are based on both the MET project findings and the experiences of the foundation’s partner districts over the past four years.

As for how to evaluate teachers whose students don’t take standardized tests (think art and music), Phillips said the Gates foundation will have more to say about that in coming months.

This article was changed after posting to clarify that measures of student academic growth used in Colorado include more than the results of statewide achievement tests.

Final MET report “Ensuring Fair and Reliable Measures”