Question: You often quote effect sizes / p-values. Are you not aware of the problems with these measures?

Answer: Yes, I am aware of the problems but it doesn’t make them useless.

**P-values**

P-values are used in null-hypothesis testing. They are a traditional approach to analysing psychological studies (education research is a subset of this field). A p-value answers variations of the following question: If the null hypothesis was true, how likely is it that we would have collected this data?

What’s the null hypothesis? If you are doing a study to see the effect that singing a song about pirates has on children’s maths performance then you would need to set up two groups. The first group would sing the song – this is the intervention group. The second group – the control group – would not. Crucially, you would need to randomly allocate children to each group or the p-value – which is a probability – would make no sense. The null hypothesis is then the hypothesis that singing a song about pirates has *no* effect on children’s maths performance.

So you collect the data, do your sums and the p-value gives you the probability that you would have generated this data in a world where the null hypothesis was true. What it *does not* tell you is the probability that the null hypothesis is true given the data.

Why does this matter? It matters because the null hypothesis could be very unlikely. For instance, imagine if our null hypothesis was that teaching children facts about animals had no effect on them learning facts about animals. The p-value would be an odd thing to try to interpret because we can be fairly certain that the null hypothesis is false.

It gets worse. Imagine I do 20 studies and only one of these generates a p-value less than 0.05. I might be tempted to send this study in to a journal and the journal might be tempted to publish it. And yet we would expect 1 in 20 studies to generate such a result even if the null hypothesis was true. This is known as “p-hacking” (another variation is to measure 20 different factors and only report on one of them).

Finally, due to the way that p-values are calculated, if we have a large enough number of subjects in our study (a large enough N) we are likely to get a ‘significant’ value regardless.

**Effect sizes**

John Hattie has probably done the most to popularise the use of effect sizes in education. Many teachers will be aware that in his 2009 book, “Visible Learning,” he made the claim that an effect size greater than d=0.40 is an effect worth having.

The effect size is a simple measure of the difference between the two averages of the intervention and control group divided by the ‘standard deviation’. The standard deviation is a measure of how spread-out the data is: if you had a test where scores were tightly bunched around 70 out of 100 then the standard deviation would be low compared to one where they were spread out with lots of scores of 30, 90 and so on.

Hattie’s error, in my view, is to treat all effect sizes equally, regardless of the study design. For instance, sometimes he compares before-treatment with after-treatment scores rather than a control group with an intervention. We expect teaching to generate an effect over time and so if our study is long then an effect of 0.4 is not very impressive.

On the other hand, if we have a randomised controlled trial (RCT) comparing a control with an intervention then any effect size above zero might be interesting. This is one reason why we should not hold the Education Endowment Foundation RCTs to Hattie’s d=0.40 standard.

Returning to the maths, the effect size is a simple quotient: difference in averages divided by standard deviation. There are therefore two ways you can increase it. You can increase the difference in averages or reduce the standard deviation.

Increasing the difference in averages can be done by writing a test that is particularly sensitive to the subject of your intervention. When experimenters design their own tests they tend to do this. This is why effect sizes on experimenter-designed tests tend to be greater than if standardised tests are used. Similarly, when you are teaching key threshold concepts to very young children, you tend to get large differences in average scores quite quickly. We should therefore not compare effect sizes from early primary with those from secondary school.

Finally, you can reduce the standard deviation of a group by being selective about the population. For instance, the standard deviation for an experiment conducted in a selective school might be expected to be lower than one conducted in a school with a comprehensive intake. The selective school study would therefore generate a larger effect size if everything else is kept the same.

**A counsel of despair?**

Should** **we therefore declare these statistical tests a nonsense? There are alternatives and these alternatives have their advocates. However, I suspect that if we are looking for a magic number that will tell us everything we need to know about a study – or worse, a group of studies – then we are asking the wrong question.

Ultimately, we need to read the details of the study itself. An experiment might generate both a low p-value and a high effect size but if it is confounded – if it varies more than one thing at a time – then it might not tell us much at all.

Statistical tests can be useful. They can give us a guide and if properly understood can help us make tentative comparisons between studies. But they must be understood with the detail of the studies in mind. Blindly applying d=0.40 or p=0.05 may be quick but it’s dirty. There’s no real alternative to reading that boring bit of the paper between the introduction and the discussion.

Reblogged this on The Echo Chamber.

And if the p value is not small the effect size may be ignored.

Hi Greg

Really appreciated your post.

Effect sizes, by themselves without the context, sample and design, are a fairly blunt instrument. Your detail of their differences, and its impact on the subsequent conclusions, that exist between with-group and between-group studies is rarely discussed. Also the type of effect size used (and I don’t include The Hattie equation in this field), such as Cohen’s (d – various depending on design) and Hedge’s (g), can have a significant impact.

From my perspective, it is unfortunate that a one-size fits all perception (and sell) of the application of effect sizes as turned it into a fairly blunt measure.

Thanks again

You say: “due to the way that p-values are calculated, if we have a large enough number of subjects in our study (a large enough N) we are likely to get a ‘significant’ value regardless.”

But this is not completely correct. If you have a large N compared to a small N, then you are more likely to get a significant value when the null hypothesis is false. But when the null is true this is not the case.

Some people, following Cohen, like to claim that “in the real world” null hypotheses are never true. But this isn’t the case: in true experiments the null can be (and often is) true because of randomisation into groups. These cases can easily be simulated with random numbers in excel or R, and by doing this you can see that significant p values aren’t more likely here for large N compared to small N when the null is false.

Daniel Laken’s has a nice discussion:

http://daniellakens.blogspot.co.uk/2014/06/the-null-is-always-false-except-when-it.html

I also disagree with your suggestion that p values make no sense in the case where participants aren’t randomised into groups. The interpretation is trickier, and the conclusions you can draw are weaker, but they’re still useful: you can say something like “this analysis is testing the model where we assume that the value of the characteristic I’m measuring has been randomly assigned to people independently of the variable I’m basing my groups on”. If you get a low p value you can conclude that that model is untenable. You can’t conclude causality, but nevertheless I think this is a useful way of stopping yourself from over-interpreting descriptive statistics.