There are a number of similarities and differences between the education systems of England and Australia. As someone who made the move from the UK, I am able to describe these differences in ways that might add a little perspective to the debates happening in England.
Early this year, I was fortunate enough to hear a keynote talk by Thomas Good, a veteran of process-product education research. As part of his talk, he outlined what this research shows about effective teaching and listed a set of criteria such as appropriate expectations, supportive classrooms and active teaching (I don’t have a link to the keynote but it is based upon this 2008 book and you can read an earlier summary of the research here).
Good was at pains to stress that these criteria should not be seen as some sort of checklist for classroom observation. They interact in complex ways and the whole is more than the sum of its parts. This is good advice.
In the UK, lessons are observed by OFSTED inspectors when they visit schools. Until recently, individual lessons were given a grade; outstanding, good, requires improvement or inadequate. The momentum for change seems to have built as bloggers such as Andrew Old exposed the workings of the system and asked questions about how these grades were being generated.
There are two basic issues with any form of assessment; reliability and validity. Classroom observation is reliable if different observers using the same criteria would give a particular lesson the same rating. It is valid if we can infer something useful from this. In this case, does a high rating reflect good teaching and a low rating reflect poor teaching?
OFSTED observations likely suffered from problems of both validity and reliability. Work at the MET project set up by the Bill and Melinda Gates Foundation indicates that ‘gold-standard’ forms of lesson observation (more later) require multiple observers, multiple observations and rigorous behavioural criteria in order to produce a moderate level of reliability (see Professor Robert Coe’s blog on this). OFSTED had none of these. Yet, we cannot know how reliable the observations actually were because the necessary data was never collected.
In terms of validity, bloggers have documented OFSTED inspectors looking for group work or marking lessons down for being too teacher-led (e.g here and here). Neither of these would seem to relate to strong research evidence around effective teaching. Collaborative learning can be highly effective but only when certain conditions are met. Simply having lots of group work (or ‘independent work’ in OFSTED’s strange jargon) is not sufficient.
Returning to the MET project, we can see what it takes to make lesson observation better than this. A number of lesson observation systems were evaluated and two of these were designed for observing a range of subjects and year groups; The Classrooms Assessment Scoring System (CLASS™) and Charlotte Danielson’s Framework for Teaching (FFT).
The researchers found good evidence for reliability and validity; teachers who scored highly on these scales tended to be more effective on measures such as value-added scores (“achievement gains”). However, the conditions required to produce reliability were atypical of standard classroom observation and were still not the best predictor of future test gains:
As I mentioned earlier, to achieve the reliability and validity found in the MET project, lesson observation scores needed to be averaged over multiple lessons and multiple raters. The lessons were videoed and raters judged the videos. Teachers presumably did not know how they were to be judged. This also means there was little possibility of backwash from the observation instrument into the teaching. This is important.
It relates to Goodhart’s Law that when a measure becomes a target it ceases to function well as a measure. Let me give an example.
Early in my career in London, the notion of writing lesson objectives on the whiteboard at the start of the lesson became a big deal. Many schools had a policy requiring this. I assumed it was linked to good practice. In fact, I now realise that the idea probably arose out of process-product research on teacher effectiveness and the importance of teachers sharing their aims with the students; an aspect of explicit instruction. And I suspect it became popular because it is easy to observe. It is a behavioural measure.
However, through this kind of mandation, we had created a system where teachers were engaging in a particular behaviour whether they understood the point of it or not. It may well be the case that effective teachers would choose to communicate aims to their students but we cannot be sure that the reverse is true. Simply making teachers share their aims will not necessarily make them effective teachers (recall Thomas Good’s note of caution). A measure that we were previously able to use as a possible proxy for effectiveness now tells us little.
All rubrics suffer from this problem, including those that we use to rate our students’ performance in complex tasks:
The implications of this effect for the assessment of things like essays is for another post.
Does this mean that we should not observe lessons? Not in my view. Personally, I believe that lesson observations are excellent for telling you if there is something badly wrong in a classroom. They can also help refine practice when used as part of a qualitative dialogue between professionals.
I am deeply skeptical about using lesson observations to generate a score on an implied linear scale, particularly if these scores are then used as part of high-stakes teacher evaluation. I suspect that observers will quickly find that they get whatever they are looking for. And then we’ll all scratch our heads as to why the quality of learning didn’t improve.