Three dogs, two cats and a rabbit

The following is a version of the argument that I advanced in yesterday’s panel discussion about meta-analysis and meta-meta-analysis at researchED 2018. It’s probably a little more coherent than the live version due to the ability to edit. It is worth reading Robert Coe’s posts on the same issue.

There are two interlinked problems with the kind of meta-meta-analysis represented by the Education Endowment Foundation’s Toolkit and John Hattie’s Visible Learning. The first is the way that the effect size metric is used (quoted as ‘months of additional progress’ in the case of the Education Endowment Foundation) and the second is the way that very different kinds of interventions are packaged together.

The effect size metric

There is, in my opinion, nothing intrinsically wrong with effect sizes. This is not an argument about whether effect sizes are good or bad, it is about the validity of the inferences that we may draw from them.

For example, in a 2011 study, Slavin, Lake, Davis and Madden use effect sizes to compare different kinds of reading intervention. They find that those based upon a reading recovery or similar approach tended to have lower effect sizes than those based upon explicit phonics teaching.


  • The participants were of a similar age.
  • The interventions were all one-to-one, so they are not comparing a one-to-one approach with a whole-class or small-group approach.
  • The measures were all standardised reading tests.
  • The studies were all experiments or quasi-experiments.

In this case, it seems reasonable to compare effect sizes (even if I might query lumping experimental and quasi-experimental studies together). It’s still not definitive and I would prefer three-armed studies to be run that compare both interventions with a control (I’ve argued that the Education Endowment Foundation should run such trials but their practice of randomising at the school level leaves many two-armed studies underpowered and so this is unlikely to happen soon).

In meta-meta-analysis, we take no such precautions about comparing like with like. Robert Slavin has argued that, because good quality studies tend to have effect sizes less that d=0.4*, if we do as Hattie suggests and apply this as a filter, we select in favour of the crappy studies. So it’s not just that poor studies are in the mix, they carry disproportionate weight.

When in comes to the Education Endowment Foundation’s Toolkit, I have paid most of my attention to the ‘metacognition and self-regulation’ strand because I have started to notice that programs that would once have sold themselves as ‘constructivist’ or ‘reform maths’ now tend to make claims that they encourage metacognition. A recent trial of a maths programme in South Australia is a current example.

If you investigate this strand, you find that experimental studies conducted by the Education Endowment Foundation tend to have much lower effect sizes than the ones quoted in the literature review (with a notable exception that we will visit later). In addition, these studies assess very different outcomes such as:

  • Reading
  • Writing
  • Maths
  • Science
  • Critical thinking

What valid inferences may we draw from an average effect size over such a range of outcomes, even if we set aside the ways that sampling, age of subjects, quality of study etc vary the effect size? Which leads to the next point.


What does an explicit writing programme where students are taught how to plan their writing, have in common with a generic thinking programme where students discuss such matters as whether it is acceptable to hit a teddy bear? Both sit within the Education Endowment Foundation’s strand of ‘metacognition and self regulation’.

The first programme, Improving Writing Quality, is an example of what Barak Rosenshine would describe as ‘strategy instruction’ and he uses strategy instruction as one of his three main sources of evidence for the effectiveness of direct instruction. It is therefore hardly surprising that this intervention had by far the largest effect size of any of the experimental studies in this strand (d=0.74).

The second programme, Philosophy for Children, does not have such a pedigree. The effect sizes for reading and maths are d=0.14 and d=0.13, that is if you really believe there was any effect at all.

So why are we averaging all these very different things? It’s like measuring the heights of three dogs, two cats and a rabbit and then working out the average. What valid inferences may we draw from such a figure?

A way forward

Professor Robert Coe has suggested that critics of meta-meta-analysis need to propose a better alternative. I don’t think we do actually bear this burden but here’s my view: Why not produce reviews, stating all the available evidence for a particular kind of intervention, but without computing an overall effect size? [The amorphous nature of the categories would then be less of an issue – this point wasn’t in my talk but was a clarification in response to a challenge from Professor Higgins].

Professor Coe has asked how we could, for instance, provide teachers with evidence about learning styles without meta-meta-analysis. The last place I would send anyone to look for this evidence is the Toolkit. Although the commentary is sound, right at the top is the spurious figure of ‘two months of additional progress’.

*Steve Higgins disputed this in the discussion, claiming that in the studies he has analysed, study quality does not affect effect sizes.


12 thoughts on “Three dogs, two cats and a rabbit

  1. I like your analogy of averaging the height of different animals.

    Gene Glass the who originally promoted the method of meta-analysis said,

    “The result of a meta-analysis should never be an average; it should be a graph.”

  2. Ted Lynch says:

    Professor Terry Wrigely has another good analogy-

    ‘Its method is based on stirring together hundreds of meta-analyses reporting on many thousands of pieces of research to measure the effectiveness of interventions.

    This is like claiming that a hammer is the best way to crack a nut, but without distinguishing between coconuts and peanuts, or saying whether the experiment used a sledgehammer or the inflatable plastic one that you won at the fair’.

  3. Sbeari says:

    Really interesting discussion. You have listened to the Ollie Lovell podcast on comparing effect sizes ( A recent paper by Simpson ( uses a funny analogy with comparing pictures of elephants and princesses. I do not agree that Slavin’s paper does fairly compare reading interventions using effect sizes: having standardised tests is not sufficient for comparing as even tests intended to measure the same outcome will have different effect sizes depending on length and form of question. The Slavin paper also looked at studies with different ranges of ability (some highly narrow, some a bit wider) and this changes ES. Also the studies used different comparison treatments and this does not make ES comparison fair.

    We have to consider comparing interventions by comparing effect sizes in studies as just as discredited in education as learning styles.

  4. Pingback: Your role in changing education research – Filling the pail

  5. Pingback: Evidence-informed teaching – Filling the pail

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.