First Person

Voices: Can evaluations do more than sort teachers?

Colorado education researcher Robert Reichardt says newly-available evidence from two economists suggests the act of evaluating teachers can result in improved performance. 

One of the key assumptions behind Senate Bill 10-191, the educator effectiveness policy, is that teacher evaluations can lead to improved teacher effectiveness. The research base on this assumption is not particularly deep. One of the strongest studies on this issue comes from a pair of economists, Eric Taylor and John Tyler. They looked at the relationship between being evaluated and subsequent teacher value-added using data from a long-running evaluation system in Cincinnati. Until recently, their research was hidden behind a pay-wall at the National Bureau of Economic Research, or NBER. However, a new summary of their work is available at EducationNext.

Their key finding is that being evaluated was associated with improved teacher effectiveness. An average student’s math scores were 4.5 percentile points higher in the year after a teacher was evaluated than the year before the evaluation. The results suggest that the process of providing teachers with feedback can lead to improved teacher effectiveness. In other words, teacher evaluation systems can have benefits beyond identifying highly-rated teachers for rewards and exiting those with low ratings.

Teacher effectiveness continued to improve in the years following the evaluation – contrary to the notion that effectiveness improvements plateau after the first few years of teaching. Teachers with the lowest evaluation scores tended to improve the most – suggesting at least some evaluations had a motivational effect.

These improvements occurred despite the fact that 90 percent of teachers were ultimately rated either “proficient” or “distinguished” (the other lower performance levels were “basic” and “unsatisfactory”). This is important because many researchers and policymakers have argued an effective evaluation system needs to have a wider distribution (i.e. a larger proportion of teachers should be rated lower).  Tyler and Taylor do say that there was more variation on sub-scales within the observation rubric and between observations (the final rating was a product of four observations). The study also does not mention using the evaluation ratings to target teacher professional development.

This study does have limits

There are several important limitations to the study. First, it focused only on mid-career teachers in grades four through eight who, on average, were evaluated once every five years. Second, no relationship was found between evaluations and subsequent teacher effectiveness in reading.

Equally important, it still leaves many unanswered questions for those designing Colorado’s new evaluation systems. In particular:

  • What aspects of the Cincinnati system were central to its success? Was it the four observations in one year, or the four performance levels?
  • Was it the use of the trained peer observers for three of those observations?
  • Was it that three of the four observations were unannounced?
  • Was it the risk of being placed on an improvement plan for low-performing teachers, or the promotion opportunities that opened for some teachers who received higher ratings?

Finally, Cincinnati’s system was based entirely on observations. What lessons does this hold for Colorado’s system, where 50 percent of the final rating will be based upon student growth?

One thing the study does make clear is that the evaluation system is not cheap. It cost approximately $7,500 per teacher evaluated, with most of the money spent on peer evaluators. This is a similar amount to the estimated yearly spending per teacher on professional development.

As a state, we are investing a lot of energy to implement a new teacher evaluation system. One study does not make a strong research base, nor does it provide answers to the myriad of design questions faced by system developers. However, this study does suggest this investment will have a payoff.

First Person

I’ve spent years studying the link between SHSAT scores and student success. The test doesn’t tell you as much as you might think.

PHOTO: Photo by Robert Nickelsberg/Getty Images

Proponents of New York City’s specialized high school exam, the test the mayor wants to scrap in favor of a new admissions system, defend it as meritocratic. Opponents contend that when used without consideration of school grades or other factors, it’s an inappropriate metric.

One thing that’s been clear for decades about the exam, now used to admit students to eight top high schools, is that it matters a great deal.

Students admitted may not only receive a superior education, but also access to elite colleges and eventually to better employment. That system has also led to an under-representation of Hispanic students, black students, and girls.

As a doctoral student at The Graduate Center of the City University of New York in 2015, and in the years after I received my Ph.D., I have tried to understand how meritocratic the process really is.

First, that requires defining merit. Only New York City defines it as the score on a single test — other cities’ selective high schools use multiple measures, as do top colleges. There are certainly other potential criteria, such as artistic achievement or citizenship.

However, when merit is defined as achievement in school, the question of whether the test is meritocratic is an empirical question that can be answered with data.

To do that, I used SHSAT scores for nearly 28,000 students and school grades for all public school students in the city. (To be clear, the city changed the SHSAT itself somewhat last year; my analysis used scores on the earlier version.)

My analysis makes clear that the SHSAT does measure an ability that contributes to some extent to success in high school. Specifically, a SHSAT score predicts 20 percent of the variability in freshman grade-point average among all public school students who took the exam. Students with extremely high SHSAT scores (greater than 650) generally also had high grades when they reached a specialized school.

However, for the vast majority of students who were admitted with lower SHSAT scores, from 486 to 600, freshman grade point averages ranged widely — from around 50 to 100. That indicates that the SHSAT was a very imprecise predictor of future success for students who scored near the cutoffs.

Course grades earned in the seventh grade, in contrast, predicted 44 percent of the variability in freshman year grades, making it a far better admissions criterion than SHSAT score, at least for students near the score cutoffs.

It’s not surprising that a standardized test does not predict as well as past school performance. The SHSAT represents a two and a half hour sample of a limited range of skills and knowledge. In contrast, middle-school grades reflect a full year of student performance across the full range of academic subjects.

Furthermore, an exam which relies almost exclusively on one method of assessment, multiple choice questions, may fail to measure abilities that are revealed by the variety of assessment methods that go into course grades. Additionally, middle school grades may capture something important that the SHSAT fails to capture: long-term motivation.

Based on his current plan, Mayor de Blasio seems to be pointed in the right direction. His focus on middle school grades and the Discovery Program, which admits students with scores below the cutoff, is well supported by the data.

In the cohort I looked at, five of the eight schools admitted some students with scores below the cutoff. The sample sizes were too small at four of them to make meaningful comparisons with regularly admitted students. But at Brooklyn Technical High School, the performance of the 35 Discovery Program students was equal to that of other students. Freshman year grade point averages for the two groups were essentially identical: 86.6 versus 86.7.

My research leads me to believe that it might be reasonable to admit a certain percentage of the students with extremely high SHSAT scores — over 600, where the exam is a good predictor —and admit the remainder using a combined index of seventh grade GPA and SHSAT scores.

When I used that formula to simulate admissions, diversity increased, somewhat. An additional 40 black students, 209 Hispanic students, and 205 white students would have been admitted, as well as an additional 716 girls. It’s worth pointing out that in my simulation, Asian students would still constitute the largest segment of students (49 percent) and would be admitted in numbers far exceeding their proportion of applicants.

Because middle school grades are better than test scores at predicting high school achievement, their use in the admissions process should not in any way dilute the quality of the admitted class, and could not be seen as discriminating against Asian students.

The success of the Discovery students should allay some of the concerns about the ability of students with SHSAT scores below the cutoffs. There is no guarantee that similar results would be achieved in an expanded Discovery Program. But this finding certainly warrants larger-scale trials.

With consideration of additional criteria, it may be possible to select a group of students who will be more representative of the community the school system serves — and the pool of students who apply — without sacrificing the quality for which New York City’s specialized high schools are so justifiably famous.

Jon Taylor is a research analyst at Hunter College analyzing student success and retention. 

First Person

With roots in Cuba and Spain, Newark student came to America to ‘shine bright’

PHOTO: Patrick Wall
Layla Gonzalez

This is my story of how we came to America and why.

I am from Mallorca, Spain. I am also from Cuba, because of my dad. My dad is from Cuba and my grandmother, grandfather, uncle, aunt, and so on. That is what makes our family special — we are different.

We came to America when my sister and I were little girls. My sister was three and I was one.

The first reason why we came here to America was for a better life. My parents wanted to raise us in a better place. We also came for better jobs and better pay so we can keep this family together.

We also came here to have more opportunities — they do call this country the “Land Of Opportunities.” We came to make our dreams come true.

In addition, my family and I came to America for adventure. We came to discover new things, to be ourselves, and to be free.

Moreover, we also came here to learn new things like English. When we came here we didn’t know any English at all. It was really hard to learn a language that we didn’t know, but we learned.

Thank God that my sister and I learned quickly so we can go to school. I had a lot of fun learning and throughout the years we do learn something new each day. My sister and I got smarter and smarter and we made our family proud.

When my sister Amira and I first walked into Hawkins Street School I had the feeling that we were going to be well taught.

We have always been taught by the best even when we don’t realize. Like in the times when we think we are in trouble because our parents are mad. Well we are not in trouble, they are just trying to teach us something so that we don’t make the same mistake.

And that is why we are here to learn something new each day.

Sometimes I feel like I belong here and that I will be alright. Because this is the land where you can feel free to trust your first instinct and to be who you want to be and smile bright and look up and say, “Thank you.”

As you can see, this is why we came to America and why we can shine bright.

Layla Gonzalez is a fourth-grader at Hawkins Street School. This essay is adapted from “The Hispanic American Dreams of Hawkins Street School,” a self-published book by the school’s students and staff that was compiled by teacher Ana Couto.