The following is a version of the argument that I advanced in yesterday’s panel discussion about meta-analysis and meta-meta-analysis at researchED 2018. It’s probably a little more coherent than the live version due to the ability to edit. It is worth reading Robert Coe’s posts on the same issue.
There are two interlinked problems with the kind of meta-meta-analysis represented by the Education Endowment Foundation’s Toolkit and John Hattie’s Visible Learning. The first is the way that the effect size metric is used (quoted as ‘months of additional progress’ in the case of the Education Endowment Foundation) and the second is the way that very different kinds of interventions are packaged together.
The effect size metric
There is, in my opinion, nothing intrinsically wrong with effect sizes. This is not an argument about whether effect sizes are good or bad, it is about the validity of the inferences that we may draw from them.
For example, in a 2011 study, Slavin, Lake, Davis and Madden use effect sizes to compare different kinds of reading intervention. They find that those based upon a reading recovery or similar approach tended to have lower effect sizes than those based upon explicit phonics teaching.
Crucially:
- The participants were of a similar age.
- The interventions were all one-to-one, so they are not comparing a one-to-one approach with a whole-class or small-group approach.
- The measures were all standardised reading tests.
- The studies were all experiments or quasi-experiments.
In this case, it seems reasonable to compare effect sizes (even if I might query lumping experimental and quasi-experimental studies together). It’s still not definitive and I would prefer three-armed studies to be run that compare both interventions with a control (I’ve argued that the Education Endowment Foundation should run such trials but their practice of randomising at the school level leaves many two-armed studies underpowered and so this is unlikely to happen soon).
In meta-meta-analysis, we take no such precautions about comparing like with like. Robert Slavin has argued that, because good quality studies tend to have effect sizes less that d=0.4*, if we do as Hattie suggests and apply this as a filter, we select in favour of the crappy studies. So it’s not just that poor studies are in the mix, they carry disproportionate weight.
When in comes to the Education Endowment Foundation’s Toolkit, I have paid most of my attention to the ‘metacognition and self-regulation’ strand because I have started to notice that programs that would once have sold themselves as ‘constructivist’ or ‘reform maths’ now tend to make claims that they encourage metacognition. A recent trial of a maths programme in South Australia is a current example.
If you investigate this strand, you find that experimental studies conducted by the Education Endowment Foundation tend to have much lower effect sizes than the ones quoted in the literature review (with a notable exception that we will visit later). In addition, these studies assess very different outcomes such as:
- Reading
- Writing
- Maths
- Science
- Critical thinking
What valid inferences may we draw from an average effect size over such a range of outcomes, even if we set aside the ways that sampling, age of subjects, quality of study etc vary the effect size? Which leads to the next point.
Packaging
What does an explicit writing programme where students are taught how to plan their writing, have in common with a generic thinking programme where students discuss such matters as whether it is acceptable to hit a teddy bear? Both sit within the Education Endowment Foundation’s strand of ‘metacognition and self regulation’.
The first programme, Improving Writing Quality, is an example of what Barak Rosenshine would describe as ‘strategy instruction’ and he uses strategy instruction as one of his three main sources of evidence for the effectiveness of direct instruction. It is therefore hardly surprising that this intervention had by far the largest effect size of any of the experimental studies in this strand (d=0.74).
The second programme, Philosophy for Children, does not have such a pedigree. The effect sizes for reading and maths are d=0.14 and d=0.13, that is if you really believe there was any effect at all.
So why are we averaging all these very different things? It’s like measuring the heights of three dogs, two cats and a rabbit and then working out the average. What valid inferences may we draw from such a figure?
A way forward
Professor Robert Coe has suggested that critics of meta-meta-analysis need to propose a better alternative. I don’t think we do actually bear this burden but here’s my view: Why not produce reviews, stating all the available evidence for a particular kind of intervention, but without computing an overall effect size? [The amorphous nature of the categories would then be less of an issue – this point wasn’t in my talk but was a clarification in response to a challenge from Professor Higgins].
Professor Coe has asked how we could, for instance, provide teachers with evidence about learning styles without meta-meta-analysis. The last place I would send anyone to look for this evidence is the Toolkit. Although the commentary is sound, right at the top is the spurious figure of ‘two months of additional progress’.
*Steve Higgins disputed this in the discussion, claiming that in the studies he has analysed, study quality does not affect effect sizes.
I like your analogy of averaging the height of different animals.
Gene Glass the who originally promoted the method of meta-analysis said,
“The result of a meta-analysis should never be an average; it should be a graph.”
Steve Higgins’ claimed that effect sizes did not differ between randomised and non-randomised studies. He may be correct, but the randomised trials are rarely of high quality in this field. In the medical literature there is plenty of evidence that lower quality randomised trials tend to report larger treatment effects than high quality ones. e.g. https://jamanetwork.com/journals/jama/article-abstract/386770
This is important. I think many in education are still at the stage of thinking RCTs are the gold standard and that doesn’t make us critical enough of RCT design. Two big issues with them – or maybe it’s just one issue in different guises – is poor quality controls and the varying of more than one factor at a time.
I spoke to Steve Higgins at the end of the session and asked why he didn’t do systematic education reviews using Cochrane methodology. The software’s freely available and I’m sure many medical Cochrane reviewers would help. He wasn’t hostile to the idea. Just thought it would be very expensive.
I need to know more about Cochrane reviews. Have you blogged about them?
yes. https://ripe-tomato.org/2011/09/08/my-first-cochrane-review/ But we need to talk properly Greg. email me via jim.thornton@nottingham.ac.uk
Professor Terry Wrigely has another good analogy-
‘Its method is based on stirring together hundreds of meta-analyses reporting on many thousands of pieces of research to measure the effectiveness of interventions.
This is like claiming that a hammer is the best way to crack a nut, but without distinguishing between coconuts and peanuts, or saying whether the experiment used a sledgehammer or the inflatable plastic one that you won at the fair’.
I really enjoy your articles but have typically to reload them 4-5 times to reach the end. WordPress web page seems very unstable. Lee McCulloch
Reblogged this on The Echo Chamber.
Really interesting discussion. You have listened to the Ollie Lovell podcast on comparing effect sizes (https://goo.gl/jFLaeN)? A recent paper by Simpson (https://goo.gl/Txnk23) uses a funny analogy with comparing pictures of elephants and princesses. I do not agree that Slavin’s paper does fairly compare reading interventions using effect sizes: having standardised tests is not sufficient for comparing as even tests intended to measure the same outcome will have different effect sizes depending on length and form of question. The Slavin paper also looked at studies with different ranges of ability (some highly narrow, some a bit wider) and this changes ES. Also the studies used different comparison treatments and this does not make ES comparison fair.
We have to consider comparing interventions by comparing effect sizes in studies as just as discredited in education as learning styles.
Pingback: Your role in changing education research – Filling the pail
Pingback: Evidence-informed teaching – Filling the pail