# The plausibility test

As a teenager, I used to love the surrealist comedy of Vic Reeves and Bob Mortimer. In one of their sketches, Vic would demonstrate an invention such as the, “home vibration cowboy”. Bob would ask, “How does it work?” and Vic would reply, “I don’t know but it does.”

This is how I feel when education interventions are tested in randomised controlled trials (RCTs) but without any plausible theoretical mechanism. Theory seems to be the aspect of research that the Education Endowment Foundation (EEF) are least interested in. Yet it is rarely possible to adopt any intervention wholesale and so teachers are interested in understanding the underlying mechanisms to help inform their own choices.

Plausible mechanisms also have implications for tests of statistical significance. Imagine we have an intervention that we want to test. Perhaps we ask students to write in red pen rather than blue pen and then measure their writing scores. The ‘null hypothesis’ is that this intervention has no effect – the colour of the pen makes no difference to the writing scores. So we conduct the study, randomly assign students to one of the two groups and compare the average writing scores. There will almost always be a difference between the two groups’ outcomes but how can we tell whether this difference is due to chance or whether it is because the pen colour caused it?

The conventional approach is to calculate a p-value. This is a probability. If the p-value is 5% then this means that in a world where the null hypothesis is true, the chance of getting this difference in outcomes is 5%. This is not the same as saying that there is a 5% chance that the null hypothesis is true, although this is the way that many people wrongly interpret it. If you are not a mathematician then you might initially struggle to see the difference between these two statements.

But imagine if we already knew that the null hypothesis was very unlikely. Perhaps we teach one group of students about Rome and we compare them with a group who we have not taught about Rome on a test of knowledge of Rome. It seems very unlikely that this would not have an effect and this would make the p-value pretty meaningless – our world where the null hypothesis is true is already unlikely. Yet this would be a very strange thing to research.

Instead, we tend to research things because we are unsure of whether they are true. Imagine I conduct the pen test. Can you think of a plausible reason why it would affect writing scores? Maybe. Perhaps. But it is not certain. So I run the study and find that p is 20%. In other words, in a world where pen colour has no effect, there is a 20% chance of getting these results. I would probably conclude that I haven’t really demonstrated any effect of pen colour.

Now let’s turn to the Philosophy for Children trial conducted by the EEF. In this case, a literacy lesson was replaced with a ‘philosophy’ lesson where children debated things such as whether is right to hit a teddy bear. These students were then compared with other students who carried on receiving a literacy lesson.

Let’s set aside for now the fact that this intervention showed no effect on the measures that were set out in advance by the researchers and accept the claim that students who received the intervention improved more in reading and maths than those who did not.

What is the possible mechanism? I think Dylan William has given the most plausible one i.e. that it convinces children of the value of thinking hard about a problem. I suggest this is the most plausible mechanism but it still seems pretty far-fetched to me. After all, it’s not clear how these lessons could increase knowledge of maths or of words and it is this kind of knowledge that is essential in maths and reading tests. Philosophy for children is perhaps an educational “home vibration cowboy”.

If there was a test of statistical significance done on this data and it came back with, say, a p-value of 20% then this would suggest that, in a world where philosophy for children had no effect on reading and maths, these results would occur about a fifth of the time. I certainly would not be adopting the approach on this basis.

However, as I have noted, the lead researcher argues that such a test would tell us nothing at all of value. There is an alternative – Bayesian analysis – that is worth discussing in its own right, but this isn’t offered either.

Another argument against stating p-values might be that the number of students who drop out of RCTs is so large that we cannot assume the final sample is random (something that the logic relys on). If, for instance, we found a statistically significant positive effect but we also knew that half of the trial schools dropped out because they were finding it difficult to implement then we could not claim that the intervention had a positive effect. I still don’t see, however, why we would refuse to quote the p-value.

Stating the result of such a test only seems harmful if we think that a p-value is intended to replace human judgement about the trial and the way it was conducted. No number can replace human judgement but, used with caution, p-values can certainly inform it.