A key flaw in the idea of comparing effect sizes

Embed from Getty Images

I was alerted* to a new paper by Hans Luyten, Christine Merrell and Peter Tymms about effect sizes. Effect sizes get bandied around a lot in education and many will have heard of John Hattie’s figure of d=.40 (0.4 of a standard deviation) as an effect size worth having. However, this is pretty blunt. We know, for instance, that effect sizes may be impacted by a number of factors, including the age of children.

Luyten and colleagues decided to try measuring some baseline effect sizes. In other words, they decided to determine the effect for normal teaching in different subject areas and for different groups of students. The students they chose were primary students in England who primarily attended independent schools and effect sizes were calculated for reading, mental arithmetic, general mathematics and a construct called ‘developed ability’ that included picture vocabulary and pattern recognition. They used a regression-discontinuity approach where very similar students in two different year groups are compared i.e. children falling either side of an arbitrary cut-off date for entry into a particular year group.

The good news is that schooling seems to have an effect. However, the effect sizes varied. They generally became smaller as children grew older. For reading, the effect size shrank from d=.55 when comparing Year 1 and Year 2 to d=.08 when comparing Years 5 and 6. Is this an indication that, once children can decode, reading cannot really be taught? See this Hirsch article for a discussion relevant to this idea.

It is therefore tempting to suggest that, if this is representative of students more generally, a reading intervention in Year 5 with an effect size of d=.30 would be worth having. However, we tend to intervene with struggling students who may in fact look more like the younger students in this study. So what effect size should we be seeking?

I think that it is impossible to say.

The changes in effect size for general maths are less severe but still significant, ranging from d=.47 to d=.27. And this only goes as far ad the end of primary school. What would these look like at secondary?

The evidence is now building. We need to move away from tables of effects such as those produced by Hattie and the Education Endowment Foundation toolkit. Effect sizes will depend on student age and level of advancement in their learning. Claiming that amorphous intervention X will deliver 4 months extra progress is invalid and unhelpful. Instead of comparing effect sizes and hoping they cross an arbitrary threshold, we need to do more studies that compare Intervention A with Intervention B and a control. In the long term, this is the only way we will remove the junk from our data; by comparing like with like.

*Best Evidence in Brief is an excellent email sent out by the Institute for Effective Education in York, England. It’s well worth subscribing to.


12 thoughts on “A key flaw in the idea of comparing effect sizes

  1. kesheck says:

    “However, we tend to intervene with struggling students who may in fact look more like the younger students in this study.”

    Forgive me, but I’m not sure what you’re saying here. I think I understand the previous sentence – that Hattie’s threshold of .4 might be setting the bar too high in judging a treatment used on students in Years 5 and 6. Do I have that right?

    I think my confusion is because I’ve never taken a methods class. But I really want to understand your point here.

  2. Wouldn’t you expect effect sizes to decrease as students get older due to the increasing spread (SD) of ability within student populations as they get older? Not sure if this is relevant to the study you mention but it is seen in large data sets from US on academic measures of 5 yr old compared to 15 yr old.

    Robert Slavin has some good stuff on effect sizes, blogs in Huffington Post. Main takeaway for me is you cannot use effect sizes like Hattie does to compare interventions, the actual effect size gained depends on the type and size of the study, among other things.

  3. Greg, Thank you for your kind words about Best Evidence in Brief. I was keen to include this study in the latest issue because I hoped it would advance readers’ (and my own) understanding of what quoted effect sizes might really mean. I think Hattie’s work was helpful in encouraging educators to consider the relative impact of different approaches, but he has set an effect size “bar” that isn’t realistic. The notion that something must have an effect size of at least +0.4 to be educationally important isn’t borne out in more robust, real-world studies, or by the context. When I’m looking at studies for possible inclusion in BEiB, if it’s above +0.5 I start to get suspicious, rather than being impressed, and delve into the design, etc.
    I think there will always be a place for broad overviews like the EEF toolkit as a starting point for those engaging with the evidence. People have to start somewhere, and it has to be accessible. That simplification will inevitably sometimes lead to over-simplification, but that’s an opportunity for further engagement, discussion, etc. We’re aware of this even with BEiB, that we shouldn’t assume a certain level of knowledge, and we should be willing to cover issues (eg, learning styles, or that great recent guidance on cognitive load theory) that we might think are “sorted”. As Bob Slavin is fond of quoting, “It’s not one damn thing after another. It’s the same damn thing over and over.”

  4. I’ve never thought the effect sizes to be more than a (useful) guideline. Anecdotally I hear about people misrepresenting the data, which I find entirely plausible, to further a particular pedagogy although not necessarily with any malicious intent. Is this an issue for better understanding of what this type of research can and can’t provide rather than removal of effect sizes from discussion?

  5. Pingback: Can teaching cause learning? | Filling the pail

  6. Arthur Pendrill says:

    Effect sizes aren’t even a useful guideline. They are not measures of educational importance, but measures of how clear a difference there was between two groups (which depends on what you measure, who the two groups are and what the control group does). You simply can’t say that if one experiment gave an effect size of 0.5 and another gave an effect size of 0.2, the first intervention is educationally better, even if it is similar age pupils, in a similar subject. Simpson gives a very clear discussion which explains why you can’t compare effect sizes, let alone average them as in these large scale syntheses (see http://bit.ly/2xplAG4)

  7. Pingback: Alphabetical Signposts To Teacher Excellence – E – Teach innovate reflect

  8. Pingback: What are the implications of John Hattie being wrong? – Filling the pail

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.