I am just starting out on my PhD study and one of the things that I need to come to understand is the use of statistics for analysing experimental data. I am therefore bound to make mistakes in this post as I try to tease out some of the issues. I would be most grateful if those more expert than me could take the time point out these errors.
Also, probability more generally is famously fraught and counterintuitive so I won’t be offended if you point out any basic howlers, whether you know about p-values or not.
One of the things that I had intended to do once I had experimental results was to calculate a p-value. What I hadn’t been aware of is the controversy surrounding p-values and their validity. To understand this, we need to understand the ideas of the null and alternative hypotheses. If we were testing a drug, for instance, the null hypothesis would be that it has no effect. The alternative hypothesis might be that it has a positive effect.
If you have any familiarity with basic high-school science then you can probably see that there is an immediate problem. The value of any conclusion will depend on a well-designed experiment. In this instance, we would want to rule out the possibility of a placebo effect as another viable alternative hypothesis.
Setting this aside for now, imagine we run an experiment and get a set of data that perhaps shows a lower mean death rate or whatever for people who take the drug. The p-value tells us the probability of gaining such a data set if the null hypothesis were true i.e. if the drug really had no effect. Different samples will always have different means and so our results could have come about by chance. In the language of probability, we have calculated the probability of our data given the null hypothesis or Pr(data|null).
There are many and various reasons why people dislike p-values but the most fundamental is that they are based upon a misconception about probability. Critics say that what we actually want is the probability of the null hypothesis given the data i.e. Pr(null|data). This is not the same thing as the probability of the data given the null hypothesis, Pr(data|null).
In fact, I teach this notion to my high school maths students. The common textbook trope is to consider a diagnostic test that is not 100% accurate and a disease with a low prevalence in the population. It can be quite simply demonstrated that the probability of having the disease given a positive result on the test is not the same as the probability of a positive result on the test given the fact that somebody has the disease.
So this is something of a coup de grace for p-values.
How do people use p-values?
Or maybe not.
It might be worth conducting a few little thought experiments here. I had never actually considered the p-value as an estimate of the probability of the null hypothesis given the data. I had tended to think of it as something that would inform my judgement about that, alongside other factors.
For instance, imagine a typical cognitive load type experiment. We take a sample of students and randomise them into two groups. The first group receives written instruction – perhaps a worked example or something – whilst the second group gets the same written instruction but also a simultaneous audio track of someone reading it out. The students subsequently complete some kind of post-test.
The null hypothesis here would be that there is no effect of being assigned to either condition. Which is pretty plausible. An alternative hypothesis might be that both reading and hearing simultaneously will overload the students in some way and so the performance of the second group will be worse than the first. Imagine that we do indeed find that this is the case and we calculate a p-value of 0.05.
This tells us that, in a world where the null hypothesis is true, the likelihood of getting a data set like our is 5%. If we replicated this experiment a number of times and got similar results then these results would not seem very consistent with such a world. Replication and simply conducting a larger experiment are mathematically the same and so we could, if we wished, normalise our data and try to estimate an overall p-value which would be shrinking in this case. This ‘increasingly inconsistent’ finding might affect our more qualitative view of the likelihood of the null hypothesis being true.
Of course, our original 5% value is not the probability of the null hypothesis being true. This would be committing the fallacy described above.
Learning what you teach them
Now let’s consider a stranger experiment. We randomise students who have never learnt about the ancient Egyptians into two groups. We teach one group about the ancient Egyptians whilst teaching the other group nothing at all. We then test them on knowledge of the ancient Egyptians.
The null hypothesis would be that teaching students about the ancient Egyptians has no effect on their knowledge of the ancient Egyptians. Imagine that we find that the students taught about the ancient Egyptians knew more about them with a p-value of 0.05. This means that in a world where teaching children about something does not effect what they know about it the chances of getting our data set is 5%. However, we might conclude that such a world is highly unlikely and so this 5% value doesn’t really tell us much at all.
The thing is, it would be an odd experiment to conduct because we already kind-of know the answer.
Philosophy for children
Finally, let’s consider a real experiment which has shaped my thinking on this. EEF conducted an RCT of a course known as ‘philosophy for children’. There has been much comment on this with some people claiming that it demonstrates regression to the mean. However, let’s set this aside and take the experiment at face value.
Children were randomised into two groups, one of which received a philosophy course. Students in the latter group saw greater gains on their mathematics performance (and other measures) than students in the control group.
No p-value was calculated because the lead researcher, Stephen Gorard, is a prominent critic of them.
The null hypothesis here would be that the philosophy course has no effect on children’s mathematics performance. To me, this seems highly plausible. It is hard enough to achieve transfer within educational domains let alone across them. Therefore, I think that a p-value would have added information here. If we knew, for instance, that in a world where philosophy courses do not affect mathematics achievement then the likelihood of getting this set of data is, say, 20% then I would be inclined to put it down to chance and ask for more replications.