The following is a version of the argument that I advanced in yesterday’s panel discussion about meta-analysis and meta-meta-analysis at researchED 2018. It’s probably a little more coherent than the live version due to the ability to edit. It is worth reading Robert Coe’s posts on the same issue.
There are two interlinked problems with the kind of meta-meta-analysis represented by the Education Endowment Foundation’s Toolkit and John Hattie’s Visible Learning. The first is the way that the effect size metric is used (quoted as ‘months of additional progress’ in the case of the Education Endowment Foundation) and the second is the way that very different kinds of interventions are packaged together.
The effect size metric
There is, in my opinion, nothing intrinsically wrong with effect sizes. This is not an argument about whether effect sizes are good or bad, it is about the validity of the inferences that we may draw from them.
For example, in a 2011 study, Slavin, Lake, Davis and Madden use effect sizes to compare different kinds of reading intervention. They find that those based upon a reading recovery or similar approach tended to have lower effect sizes than those based upon explicit phonics teaching.
- The participants were of a similar age.
- The interventions were all one-to-one, so they are not comparing a one-to-one approach with a whole-class or small-group approach.
- The measures were all standardised reading tests.
- The studies were all experiments or quasi-experiments.
In this case, it seems reasonable to compare effect sizes (even if I might query lumping experimental and quasi-experimental studies together). It’s still not definitive and I would prefer three-armed studies to be run that compare both interventions with a control (I’ve argued that the Education Endowment Foundation should run such trials but their practice of randomising at the school level leaves many two-armed studies underpowered and so this is unlikely to happen soon).
In meta-meta-analysis, we take no such precautions about comparing like with like. Robert Slavin has argued that, because good quality studies tend to have effect sizes less that d=0.4*, if we do as Hattie suggests and apply this as a filter, we select in favour of the crappy studies. So it’s not just that poor studies are in the mix, they carry disproportionate weight.
When in comes to the Education Endowment Foundation’s Toolkit, I have paid most of my attention to the ‘metacognition and self-regulation’ strand because I have started to notice that programs that would once have sold themselves as ‘constructivist’ or ‘reform maths’ now tend to make claims that they encourage metacognition. A recent trial of a maths programme in South Australia is a current example.
If you investigate this strand, you find that experimental studies conducted by the Education Endowment Foundation tend to have much lower effect sizes than the ones quoted in the literature review (with a notable exception that we will visit later). In addition, these studies assess very different outcomes such as:
- Critical thinking
What valid inferences may we draw from an average effect size over such a range of outcomes, even if we set aside the ways that sampling, age of subjects, quality of study etc vary the effect size? Which leads to the next point.
What does an explicit writing programme where students are taught how to plan their writing, have in common with a generic thinking programme where students discuss such matters as whether it is acceptable to hit a teddy bear? Both sit within the Education Endowment Foundation’s strand of ‘metacognition and self regulation’.
The first programme, Improving Writing Quality, is an example of what Barak Rosenshine would describe as ‘strategy instruction’ and he uses strategy instruction as one of his three main sources of evidence for the effectiveness of direct instruction. It is therefore hardly surprising that this intervention had by far the largest effect size of any of the experimental studies in this strand (d=0.74).
The second programme, Philosophy for Children, does not have such a pedigree. The effect sizes for reading and maths are d=0.14 and d=0.13, that is if you really believe there was any effect at all.
So why are we averaging all these very different things? It’s like measuring the heights of three dogs, two cats and a rabbit and then working out the average. What valid inferences may we draw from such a figure?
A way forward
Professor Robert Coe has suggested that critics of meta-meta-analysis need to propose a better alternative. I don’t think we do actually bear this burden but here’s my view: Why not produce reviews, stating all the available evidence for a particular kind of intervention, but without computing an overall effect size? [The amorphous nature of the categories would then be less of an issue – this point wasn’t in my talk but was a clarification in response to a challenge from Professor Higgins].
Professor Coe has asked how we could, for instance, provide teachers with evidence about learning styles without meta-meta-analysis. The last place I would send anyone to look for this evidence is the Toolkit. Although the commentary is sound, right at the top is the spurious figure of ‘two months of additional progress’.
*Steve Higgins disputed this in the discussion, claiming that in the studies he has analysed, study quality does not affect effect sizes.