Can teaching be given a score?

Early this year, I was fortunate enough to hear a keynote talk by Thomas Good, a veteran of process-product education research. As part of his talk, he outlined what this research shows about effective teaching and listed a set of criteria such as appropriate expectations, supportive classrooms and active teaching (I don’t have a link to the keynote but it is based upon this 2008 book and you can read an earlier summary of the research here).

Good was at pains to stress that these criteria should not be seen as some sort of checklist for classroom observation. They interact in complex ways and the whole is more than the sum of its parts. This is good advice.

In the UK, lessons are observed by OFSTED inspectors when they visit schools. Until recently, individual lessons were given a grade; outstanding, good, requires improvement or inadequate. The momentum for change seems to have built as bloggers such as Andrew Old exposed the workings of the system and asked questions about how these grades were being generated.

There are two basic issues with any form of assessment; reliability and validity. Classroom observation is reliable if different observers using the same criteria would give a particular lesson the same rating. It is valid if we can infer something useful from this. In this case, does a high rating reflect good teaching and a low rating reflect poor teaching?

OFSTED observations likely suffered from problems of both validity and reliability. Work at the MET project set up by the Bill and Melinda Gates Foundation indicates that ‘gold-standard’ forms of lesson observation (more later) require multiple observers, multiple observations and rigorous behavioural criteria in order to produce a moderate level of reliability (see Professor Robert Coe’s blog on this). OFSTED had none of these. Yet, we cannot know how reliable the observations actually were because the necessary data was never collected.

In terms of validity, bloggers have documented OFSTED inspectors looking for group work or marking lessons down for being too teacher-led (e.g here and here). Neither of these would seem to relate to strong research evidence around effective teaching. Collaborative learning can be highly effective but only when certain conditions are met. Simply having lots of group work (or ‘independent work’ in OFSTED’s strange jargon) is not sufficient.

Returning to the MET project, we can see what it takes to make lesson observation better than this. A number of lesson observation systems were evaluated and two of these were designed for observing a range of subjects and year groups; The Classrooms Assessment Scoring System (CLASS™) and Charlotte Danielson’s Framework for Teaching (FFT).

The researchers found good evidence for reliability and validity; teachers who scored highly on these scales tended to be more effective on measures such as value-added scores (“achievement gains”). However, the conditions required to produce reliability were atypical of standard classroom observation and were still not the best predictor of future test gains:

“The composite that best indicated improvement on state tests heavily weighted teachers’ prior student achievement gains based on those same tests.”

As I mentioned earlier, to achieve the reliability and validity found in the MET project, lesson observation scores needed to be averaged over multiple lessons and multiple raters. The lessons were videoed and raters judged the videos. Teachers presumably did not know how they were to be judged. This also means there was little possibility of backwash from the observation instrument into the teaching. This is important.

It relates to Goodhart’s Law that when a measure becomes a target it ceases to function well as a measure. Let me give an example.

Early in my career in London, the notion of writing lesson objectives on the whiteboard at the start of the lesson became a big deal. Many schools had a policy requiring this. I assumed it was linked to good practice. In fact, I now realise that the idea probably arose out of process-product research on teacher effectiveness and the importance of teachers sharing their aims with the students; an aspect of explicit instruction. And I suspect it became popular because it is easy to observe. It is a behavioural measure.

However, through this kind of mandation, we had created a system where teachers were engaging in a particular behaviour whether they understood the point of it or not. It may well be the case that effective teachers would choose to communicate aims to their students but we cannot be sure that the reverse is true. Simply making teachers share their aims will not necessarily make them effective teachers (recall Thomas Good’s note of caution). A measure that we were previously able to use as a possible proxy for effectiveness now tells us little.

All rubrics suffer from this problem, including those that we use to rate our students’ performance in complex tasks:

How rubrics fail - Greg Ashman

The implications of this effect for the assessment of things like essays is for another post.

Does this mean that we should not observe lessons? Not in my view. Personally, I believe that lesson observations are excellent for telling you if there is something badly wrong in a classroom. They can also help refine practice when used as part of a qualitative dialogue between professionals.

I am deeply skeptical about using lesson observations to generate a score on an implied linear scale, particularly if these scores are then used as part of high-stakes teacher evaluation. I suspect that observers will quickly find that they get whatever they are looking for. And then we’ll all scratch our heads as to why the quality of learning didn’t improve.


20 thoughts on “Can teaching be given a score?

  1. In answer to your question Greg, yes, I do think teaching quality can be measured, particularly for research purposes. It’s what those scores mean and how they are used that matters. I think the correct use of standardised observational tools, like CLASS, is fairer than judging teachers on student test scores because such observations can – believe it or not – highlight where teachers are doing a brilliant job in very challenging schools, when things like NAPLAN might indicate otherwise (and vice versa).

    I’ve heard dreadful things about Ofsted observations and agree with the points made by Coe, but we’re talking about a totally different thing here. From what I’ve read, teachers in England are being judged as satisfactory, in need of improvement or unsatisfactory, on the basis of a 15 min observation by people who often aren’t trained ! That’s absurd and, in my view, unethical. In our study, we’ve had 3 certified observers using the CLASS measure and the concordance between the three (sometimes observing the same lessons and sometimes observing different ones) was freakishly consistent. It takes (realistically) about 50 hours of training to achieve reliability and certification. I had a fairly long explanation of CLASS in my blog post but it was taken out (lest I rob readers of the will to live) because I anticipated some folks might find the concept worrisome. I’ve copied that in below FYI.

    “The observational measure that we are using is not well known in Australia but it is used in every HeadStart classroom in the United States and is based on rigorous research involving 4,000 classrooms. To my knowledge, the E4Kids ARC Linkage project led by Professor Collette Taylor from the University of Melbourne is the only other Australian study to have used it.

    Developed by a team of researchers led by Professor Robert Pianta in the Centre for the Advanced Study of Teaching and Learning (CASTL) at the University of Virginia, the Classroom Assessment Scoring System (or CLASS) is a highly sensitive observational tool that attends to the many and varied aspects of quality teaching. Elements that are known to be strongly associated with children’s social development and academic achievement are assessed via 10 key dimensions which are split between three conceptual domains:
    – Emotional Support
    o Positive Climate, Negative Climate, Teacher Sensitivity, and Regard for Student Perspectives
    -Classroom Organisation
    o Behaviour Management, Productivity, and Instructional Learning Formats
    – Instructional Support
    o Concept Development, Quality of Feedback, and Language Modelling.

    Each dimension is, in turn, described by explicit categories or indicators. The dimension of Behaviour Management, for example, comprises 4 indicators including evidence of clear behaviour expectations, evidence of proactive management, evidence of redirection, and student mis/behaviour. For each of the 10 dimensions, ratings based on the presence or absence of key behavioural markers for each indicator are made on a 7-point scale, ranging from “Low” (0-2), “Mid” (3-5) to “High” (6-7). Observation sets involve a minimum of two hours per classroom comprising four ½ hour cycles that each incorporate 20 minutes observation and 10 minutes coding. This might not sound like much to ethnographers but, as teaching is so complex and the tool is so comprehensive, a great deal is captured during each observational cycle.”

    The other thing that I like about the comprehensiveness of the CLASS is that regardless of how strong or weak the teacher may be, the measure picks up areas in which they would benefit from coaching or PD and strengths on which they may be able to do the same with others. It certainly doesn’t and should never be used to assign a single score or verdict like ‘unsatisfactory’. Together with other measures (like child measures and child report) it can tell you a lot about classroom climate and its effects on a whole range of things, not just (sigh!) test scores.

    • Thanks for your contribution. I really appreciate you taking the time to comment on my blog. Having researched it, I definitely think that CLASS and FFT represent the gold standard for lesson observation and I am happy with CLASS being used as a research tool. So there is much common ground here. I am a little more sceptical about it being used as part of teacher improvement for the reasons that I have outlined – although I will retain an open mind. I acknowledge that it is likely to be far better for this purpose than the OFSTED regime.

  2. PS: my son’s school does that thing of writing lesson objectives on the board etc. Whilst I can see where the idea for this came from (one CLASS indicator is ‘clarity of learning objectives’), things can become McDonaldised and lose their power as well as their meaning in the process. I completely agree with you on that one because once an action becomes perfunctory or if the teacher doesn’t fully understand what they’re doing or why, then it means nothing in the most important way: which whether the kids understand what the learning objectives are.

  3. Generally, I think the principle of targets which are focussed on apparent indicators of good practice are very suspect. In the commercial sphere these are often KPIs (key performance indicators), and they often work exactly as you describe. This also has a bearing on observing ‘outstanding’ practioners and attempting to transfer those characteristics to others. I am not convinced that this is possible in any sphere. I agree, that such things can highlight very poor practice, but I am not convinced that they can create excellence. Excellence is more than a narrow set of characteristics, and is often a result of a combination of practice and intention (motives matter). I am increasingly convinced that the pedagogy is not key, that it is possible to be a highly effective teacher via different pedagogies….but which one will depend on various factors for each individual teacher and within a specific educational context. Trying to measure it doesn’t appear to lead to improved practice. That should tell us something important.

  4. David says:

    Hi Greg,

    Two things for you to look at:

    David Berliner, “Exogenous Variables and Value-Added Assessments: A Fatal Flaw,” Teachers’ College Record, 2014:


    Joseph Murphy, Philip Hallinger and Ronald Heck, “Leading via Teacher Evaluation: The Case of the Missing Clothes?,” Educational Researcher (2013):

    Both articles support you point in this post about the difficulties of developing a scoring system for teacher evals. Also, as we have been seeing in the US, the drive to provide metrics for a process involving human interactions is questionable at best.

  5. Pingback: OTR Links 04/04/2015 | doug — off the record

  6. Pingback: The McNamara Fallacy and the Problem with Numbers in Education | chronotope

  7. Pingback: The truth about teaching methods | Filling the pail

  8. Pingback: How is reading being taught in the wild? | Filling the pail

  9. Pingback: 200,000 | Filling the pail

  10. Pingback: Half a million hits – my top five posts | Filling the pail

  11. Hi Greg. Is it ok if I use the rubric image in slides for teacher training? I noticed the copyright and don’t want to use it if you are unhappy with such use.

  12. Pingback: ‘No excuses’ dealt a devastating blow? | Filling the pail

  13. Pingback: A million hits… – Filling the pail

  14. Pingback: Teaching methods still matter – Filling the pail

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.