To many readers, it probably seems arcane to argue about statistical tests. And yet such tests, through their role in research, have a material impact on the work of teachers.
A p-value is a probability. In fact it’s a special kind of probability known as a ‘conditional probability’ that depends on certain assumptions. Imagine, for instance, that we run a study where we randomly allocate students to one of two groups. The first group is taught maths in the standard way and the second group is given an abacus to use, while everything else about the teaching is kept the same. After the teaching section of the study, we test the students and find that the ones using the abacus get a higher average score. It is possible that the abacus has not actually made any difference (this is known as the ‘null hypothesis’) and we just happened to allocate a slightly more able mix of students to the abacus group, or perhaps the test is multiple choice and the students in the abacus group just got lucky when selecting answers.
In this case, we can ask ‘if the null hypothesis was true, what would be the probability of gathering data as extreme or more extreme than the data we have gathered?’ This calculation provides a ‘p-value’; a test of statistical significance that is the cause of much controversy in some research circles.
What the p-value does not tell us is the probability that the null hypothesis is true, because we have already assumed that it is. It also does not tell us the overall probability of obtaining the data that we have obtained, because we have restricted our calculation to a condition where the null hypothesis is true. So we cannot assign probabilities in a general sense (an alternative approach attempts to plug this gap by essentially guessing some of the factors needed to calculate these probabilities but I’m not convinced that this is much of an improvement).
When I talk about p-values, I am often told by p-value sceptics that I am mistaken. I don’t understand conditional probability, they may claim. They may repeat the points that I have made above that the p-value does not tell us the probability that the null hypothesis is true. They may suggest I read more so that I will understand better.
This is a classic ‘straw man’ argument in that it opposes claims that I have not made and, as such, it becomes increasingly surreal to attempt to address it.
Ultimately, I think that sceptics simply believe that p-values don’t tell us very much and, given that they are prone to misinterpretation, we should do away with them. I disagree.
Used with appropriate caution, they can be helpful. And I will expand on that shortly. But first, let’s look at what they cannot do.
Returning to the abacus study, p-values will not tell us whether our experiment was well-designed. If there is an effect, it is possible that the abacus acted as a placebo rather than imparting anything instructionally useful. If we assigned the two groups to two different teachers then any effect might be due to the teachers rather than the abacus. If there was no effect, then this might be due to the fact that students never even picked-up the abacus in class or teachers didn’t use it effectively. There are lots of potential issues that p-values cannot help us with
Moreover, imagine a world in which abacuses definitely have no impact on maths teaching, however well they are used. Imagine that lots of different research groups run studies in this world to assess the effect of abacuses. Roughly 1 in every 20 such studies should result in a p-value less than .05. If these are the only studies that get published then we have a problem.
There are two possible solutions. Firstly, we could publicly register each study in advance. That way, we would not be relying on only the published studies. Instead of seeing just the one statistically significant result, we would see it alongside 19 that were not. Secondly, we could try to replicate the published study. In a world where abacuses have no effect, our replication is not likely to produce a statistically significant result.
Statistical significance matters in many educational interventions, particularly if we are doubtful about any likely effect. Take the example of the Philosophy for Children (P4C) study run by the Education Endowment Foundation. The authors claimed this showed that ‘philosophy’ lessons for primary school children, where they discussed issues such as whether it was okay to hit a teddy bear, had a positive impact on standardised reading and maths scores. In this case, the null hypothesis is highly plausible. I would suggest it is even likely. How could such lessons affect reading and maths scores? A p-value would tell us how likely our data would be if the null hypothesis were true. If the answer is ‘reasonably likely’ then I see no reason to reject the null hypothesis. No, we haven’t calculated a probability that the null hypothesis is true but we have made a reasonable inference based on the data. And yet the researchers did not calculate a p-value.
In the particular case of P4C there is a further problem. The measure used to determine the effect was not one that was registered in advance so, even if the researchers had calculated a p-value, we would need to bear that in mind. It is possible to look at data multiple different ways and then only report the measures that seem to show an effect; a phenomenon known as ‘p-hacking’. There are ways of adjusting p-values to take account of the use of multiple measures in this way and so these would need to be applied in the case of P4C.
P4C has now been massively scaled-up on the basis of this one supposedly positive trial result. More schools will be introduced to it. Someone at your school may decide to introduce it in the belief that it is backed by strong evidence.
This could become a growing problem if p-value sceptics win the argument. Already, one quality journal has banned the use of p-values and there are suggestions that this may be leading to more false positive results i.e. we think there is an effect when there is not. If this trend continues, we could find more interventions making their way into the classroom based upon false positive results and that would be a backward step.
Instead, we should keep calculating statistical significance. The p-value sceptics have probably done a good job in publicising misconceptions around their use and raising researchers’ scepticism when confronted with dodgy conclusions based upon p-values alone. This heightened scepticism may actually make them more useful to us than similar measures that have attracted less publicity and that could therefore lead us into thinking we know things that we don’t.