Question: You often quote effect sizes / p-values. Are you not aware of the problems with these measures?
Answer: Yes, I am aware of the problems but it doesn’t make them useless.
P-values are used in null-hypothesis testing. They are a traditional approach to analysing psychological studies (education research is a subset of this field). A p-value answers variations of the following question: If the null hypothesis was true, how likely is it that we would have collected this data?
What’s the null hypothesis? If you are doing a study to see the effect that singing a song about pirates has on children’s maths performance then you would need to set up two groups. The first group would sing the song – this is the intervention group. The second group – the control group – would not. Crucially, you would need to randomly allocate children to each group or the p-value – which is a probability – would make no sense. The null hypothesis is then the hypothesis that singing a song about pirates has no effect on children’s maths performance.
So you collect the data, do your sums and the p-value gives you the probability that you would have generated this data in a world where the null hypothesis was true. What it does not tell you is the probability that the null hypothesis is true given the data.
Why does this matter? It matters because the null hypothesis could be very unlikely. For instance, imagine if our null hypothesis was that teaching children facts about animals had no effect on them learning facts about animals. The p-value would be an odd thing to try to interpret because we can be fairly certain that the null hypothesis is false.
It gets worse. Imagine I do 20 studies and only one of these generates a p-value less than 0.05. I might be tempted to send this study in to a journal and the journal might be tempted to publish it. And yet we would expect 1 in 20 studies to generate such a result even if the null hypothesis was true. This is known as “p-hacking” (another variation is to measure 20 different factors and only report on one of them).
Finally, due to the way that p-values are calculated, if we have a large enough number of subjects in our study (a large enough N) we are likely to get a ‘significant’ value regardless.
John Hattie has probably done the most to popularise the use of effect sizes in education. Many teachers will be aware that in his 2009 book, “Visible Learning,” he made the claim that an effect size greater than d=0.40 is an effect worth having.
The effect size is a simple measure of the difference between the two averages of the intervention and control group divided by the ‘standard deviation’. The standard deviation is a measure of how spread-out the data is: if you had a test where scores were tightly bunched around 70 out of 100 then the standard deviation would be low compared to one where they were spread out with lots of scores of 30, 90 and so on.
Hattie’s error, in my view, is to treat all effect sizes equally, regardless of the study design. For instance, sometimes he compares before-treatment with after-treatment scores rather than a control group with an intervention. We expect teaching to generate an effect over time and so if our study is long then an effect of 0.4 is not very impressive.
On the other hand, if we have a randomised controlled trial (RCT) comparing a control with an intervention then any effect size above zero might be interesting. This is one reason why we should not hold the Education Endowment Foundation RCTs to Hattie’s d=0.40 standard.
Returning to the maths, the effect size is a simple quotient: difference in averages divided by standard deviation. There are therefore two ways you can increase it. You can increase the difference in averages or reduce the standard deviation.
Increasing the difference in averages can be done by writing a test that is particularly sensitive to the subject of your intervention. When experimenters design their own tests they tend to do this. This is why effect sizes on experimenter-designed tests tend to be greater than if standardised tests are used. Similarly, when you are teaching key threshold concepts to very young children, you tend to get large differences in average scores quite quickly. We should therefore not compare effect sizes from early primary with those from secondary school.
Finally, you can reduce the standard deviation of a group by being selective about the population. For instance, the standard deviation for an experiment conducted in a selective school might be expected to be lower than one conducted in a school with a comprehensive intake. The selective school study would therefore generate a larger effect size if everything else is kept the same.
A counsel of despair?
Should we therefore declare these statistical tests a nonsense? There are alternatives and these alternatives have their advocates. However, I suspect that if we are looking for a magic number that will tell us everything we need to know about a study – or worse, a group of studies – then we are asking the wrong question.
Ultimately, we need to read the details of the study itself. An experiment might generate both a low p-value and a high effect size but if it is confounded – if it varies more than one thing at a time – then it might not tell us much at all.
Statistical tests can be useful. They can give us a guide and if properly understood can help us make tentative comparisons between studies. But they must be understood with the detail of the studies in mind. Blindly applying d=0.40 or p=0.05 may be quick but it’s dirty. There’s no real alternative to reading that boring bit of the paper between the introduction and the discussion.