Can meta-analysis be saved?

Embed from Getty Images

My thoughts on meta-analysis have changed over the years. My first encounter was through John Hattie’s 2008 book, Visible Learning, and I was aware of some key objections from the outset because Hattie outlines them himself, before going on to rebut them. The Hattie hypothesis is that even though the studies he uses have different subjects, designs, durations, assessment instruments and so on, this all washes out if you look for effect sizes above the typical effect size of about 0.4 standard deviations. To a teacher new to research, this seemed like a reasonable argument.

In the intervening years, I have realised just how noisy education data is and how much effect sizes depend on particular conditions. If you want a large effect size, run a quasi-experiment with a narrow ability range of very young children and make sure that you design the outcome assessment. On the other hand, if you run a properly controlled randomised controlled trial (RCT) with older teenagers and assess them with a standardised test, you are doing well to get any positive effect at all. You can’t just mush all these different kinds of studies together as if they are equivalent.

An evolution of Hattie’s approach might be to find typical effect sizes for all of the different kinds of studies, subjects and assessments. We could then modify the effect size threshold that we look for. So 0.4 might be appropriate for quasi-experiments in primary schools, for instance, and 0.1 might be the threshold for RCTs with high school students. We could draw-up a table. I don’t see anyone working on this and I think it would be challenging. We would lack enough studies to generate these typical levels in many areas and even once we have categorised study types, a bad study may still generate a larger effect than a good study of the same kind.

An alternative to the Hattie approach is to be very selective about the studies that you look at by excluding studies that don’t meet set criteria. We might call this the What Works Clearinghouse method. And yet this poses its own problems. You can end-up making statements that ‘there is no evidence to support’ something when what you actually mean is ‘there is no evidence that fits our selection criteria’. It therefore leaves out a lot of evidence that may be indicative, if not the final say on the matter. As a consumer of research, I would rather know about these inferior studies alongside their limitations than be led into thinking that no such studies exist.

I also want to know about correlations, single-subject research or any other kind of research that might have a bearing on my decision-making. I was blown-away a few years ago to read about research conducted by Kevin Wheldall and colleagues on seating arrangements. Briefly, they changed the usual arrangement of rows or groups in a classroom to the alternative arrangement before changing them back again, all the while monitoring on-task behaviour. It makes quite a compelling case for the value of seating students in rows but it is not a traditional RCT or quasi-experiment. Do we want to filter out all such research and fool ourselves into thinking it’s never been conducted? It is such exclusion/inclusion criteria that may have led to the strangely ahistorical nature of educational research that I once saw Thomas Good, veteran of process-product research, complain about at the ICSEI conference in Cincinnati.

I am fairly clear on what we should be doing to resolve these issues in the medium term. If we want to know whether teaching calculus through dance is better or worse than teaching it on horseback, then we should run an RCT with three conditions; dancing, horseback and the standard approach. This is far superior to comparing the effect size of dancing versus a control with the effect size of horseback versus a control because we know that all the other conditions are the same. There are still dangers and we still need to look at the detail to ensure that we have compared the approaches fairly, but this is something that we should be able to do.

When it comes to summaries of existing research, these must be entirely qualitative. There should be no attempt to compute an overall effect size or ‘months of additional progress’ in the manner of England’s Education Endowment Foundation. It is meaningless, particularly when we might dispute whether different studies represent examples of the same broad strategy or not. Instead, we need what is effectively a literature review of all the relevant evidence, its strengths and weaknesses. This should be open to continual review and discussion and it will be quite an art to make it pithy enough to accurately and historically capture the state of evidence in a particular area. Nevertheless, it is a venture worth embarking upon. There is no valid alternative.


17 thoughts on “Can meta-analysis be saved?

  1. Tom Burkard says:

    RCTs have a massive limitation that is very seldom noted: any idea which runs strongly against the prior training of teachers will seldom, if ever, be implemented faithfully. This is especially significant if it is conducted in primary education, where teachers have been fully alerted to the dangers of drill and kill and are much less likely to have a degree-level qualification in an academic subject, or when the intervention is complex.

    In a Michigan study of early reading interventions, Standerford found that such changes as teachers actually implemented “were seldom inconsistent with their prior preferences or beliefs about the best way to teach reading”. In an evaluation of the National Numeracy Strategy in England, Anghileri found that there was “…little uniformity in the methods that students use despite the apparent widespread implementation of a national Framework. Teachers appear to be selective in implementing the guidelines for calculating methods”.

    However, I think we can take it as read that the failure to find support for ideas that mirror education fashion is indeed significant. When the CfBT reviewed the massive literature promoting AfL, they noted that

    “There is only one quantitative study that has been conducted which was clearly and completely centred on studying the effect of AfL on student outcomes. This produced a significant, but modest, mean effect size of 0.32 in favour of AfL as being responsible for improving students’ results in externally mandated examinations. It must be mentioned, however, that this study has some methodological problems, explicitly recognised by their authors. These are related to the diversity of control groups they considered and the variety of tests included for measuring students’ achievement. All this affects the robustness of comparisons within the study.”

    When one considers that AfL was an international phenomenon promoted extensively in many nations, this is indeed a valuable study. Pity was that few paid any notice; AfL has died more out of ennui than disproof.

  2. ijstock says:

    I think there are more basic, less formal objections too – not least the fact that you need to pre-define ‘success’ before you can measure it. As Hattie accepts, this is largely test outcomes, but there are many more desirable learning outcomes than that.Then there is the problem that the effect size of any strategy will differ with competence, in other words novices make more apparent progress than experts, where further progress is a matter of refinement.Finally, macro effect sizes simply do not hve the fine resolution to be useful in the individual classroom, let alone with individual pupils because as you say there are too many other factors at play – in other words causal density is too high, as is inevitably the case. Might I humbly inform you that I discussed this at length in my recent book ‘The Great Exception’ – and offered an alternative which I believe has far more traction for the individual teacher and learner?

  3. Totally agree with you Greg. Ollie Lovell interviews Adrian Simpson about his paper – The misdirection of public policy: comparing and combining standardised effect sizes. The interview goes into detail about many of the points you’ve made. I think it will be on Ollie’s website in a couple weeks –

    The Victorian Government has released it’s 10 High Impact Teaching Strategies (HITS), largely based on Hattie and Marzano’s meta-analyses. It looks like all our performance reviews will have to be tied to those 10 HITS. Also, in applying for jobs we apparently will now have to produce evidence we’ve used those HITS. I think this is very short-sighted and restrictive.

    One of the alternatives might be to find out from the teachers themselves what works. We have around 50,000 teachers in this state and their opinions are rarely canvassed.

    IJSTOCK can you give us more clues about your book – The Great Exception?

  4. Sorry to play devil’s advocate…do we not think that meta-analyses can provide good principles for effective teaching in a very general way?

    For example, would we, as a profession, have paid as much attention to ‘direct’ or ‘guided’ instruction if it wasn’t for Hattie’s work outlining that it had a large effect size?

    Would love your thoughts on this.


    • For the last 40+ years, direct and guided instruction has always been promoted in teacher training and textbooks. I’m not sure this is Hattie’s doing. In my experience, over that time students have been bored and disengaged by that approach so there have been many attempts to change the math’s classroom to be more engaging via other strategies – problem-solving, open-ended problems, visualisation, student choice, ability grouping. But all these strategies according to Hattie, have low effect sizes and worse he has used polemic like ‘disasters’, ‘going backwards’, ‘also-rans’ to describe these general strategies.

      But now we are learning that the whole methodology is flawed, and worse that within that methodology are mistakes, bias, cherry picking and misrepresentation. Greg’s recent post on the EEF and ability grouping is one example, but the peer reviews have a litany of examples in Hattie’s work. I’ve put a list of these here –

      Then there are all the contradictions. Hattie promotes direct instruction but he also says the teacher talks too much!

      In the Australian TV doco ‘revolution school’ Hattie promotes his software (conflict of interest?) which measures how long a teacher speaks compared to students and i think the magic number is a teacher should speak less than 40% of the time.

      So I agree with Greg, the effect sizes calculated by Hattie and EEF are meaningless (for specific and general strategies) and we need to look to qualitative studies – ‘There is no valid alternative’.

  5. To a medical researcher, it seems bonkers that Hattie combines all studies of the same intervention into a single effect size. Why should “sitting in rows”, for example, have the same effect on primary children as on university students, on maths as on art teaching, on behaviour outcomes as on knowledge outcomes? In medicine it would be like combining trials of steroids to treat rheumatoid arthritis, effective, with trials of steroids to treat pneumonia, harmful, and concluding that steroids have no effect! I keep expecting someone to tell me I’ve misread Hattie.

    • That’s a great example Jim, I saw someone else start to do a similar thing with ranking influences on Health. Firstly the notion of ‘influence’ is quite vague and so is ‘health’. So the notion of an influence on academic achievement is equally vague and can be measured in totally different ways.

      The rankings on Health might go something like this-
      Viagra = 10
      Prozac = 9.9 (if studies were done pre 1990)
      Surgery = 5.1
      Vitamins = 3.66
      Self report health (expectation) = 3.1
      Feedback = 0.73
      Doctor/patient relationship = 0.72
      Home environment = 0.57
      Socio/economic status = 0.57
      Number of beds in ward = .30
      Home visitation = 0.29
      Doctor/patient ratio = .21
      Doctor training = 0.11
      Govt versus Private Hospital = 0.03
      Steroids = -0.05
      Physiotherapy = -.06
      Oesteopathy = -.06
      Chiropractic = -.59 (medical prof does not like Chiro’s!!)
      Holistic = -0.65
      Chinese acupuncture = -1.08
      Intensive Care = -2.99

      The outlier, Intensive Care, was from the odd spot in the Age-
      A Russian Hospital found all the patients in an intensive care unit were dead every Monday Morning!

      After six months they realised it was the Sunday night cleaner turning off the heart/lung machine etc so he could plug in his vacuum cleaner.

      As a result, the intensive care effect size was largely negative.

      • Love it! So why do I keep seeing some version of Hattie presented as a model to follow at “evidence based education talks”. Are educationalists stupid? Or willfully misunderstanding?

  6. Pingback: Can You Rely on Meta-analysis? Can You Doubt It? |Education & Teacher Conferences

  7. that’s a good question.

    Professor Scott Eacott says, In his excellent analysis – School Leadership and the cult of the guru: the neo-Taylorism of Hattie,

    ‘Hattie’s work has provided school leaders with data that appeal to their administrative pursuits.’

    So when leaders push his stuff, the hierarchical nature of education means that if any lower ranking teacher questions leadership they are in a bit of trouble.

    Most teachers don’t have the time to think deeply about this or read the meta-analyses behind Hattie’s work, let alone look at peer reviews. Then the peer reviews are often behind pay walls anyway.

    I’ve tried to collect the peer reviews and meta-analyses and summarise them and put them in a place that’s easily accessible like this one on class size –

    But i’m finding even that’s too much for the ave teacher to read.

    • Ted Lynch says:

      Hattie does have drugs on his list with effect size = 0.33 and homework is only 0.29!

      Does this mean we should give drugs rather than homework?

      Although Hattie does not specify what particular drugs nor the dose.

      • When I said educationalists, didn’t mean teachers. I meant the hundreds, maybe thousands, of professors of education, in UK alone, who’ve never run an RCT, recruited a participant or done a meta-analysis of RCTs, but still tell teachers how to teach.

  8. Pingback: Is the Education Endowment Foundation chicken? – Filling the pail

  9. Susan Bearing says:

    The podcast on meta-analysis (and the accompanying paper) make for fascinating listening (and reading) and really examines closely the approach adopted by Hattie – but Hattie gets to come back in the next episode it appears.

  10. Pingback: An investigation of the evidence John Hattie presents in Visible Learning – Site Title

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.