Randomised controlled trials (RCTs) are a scientific approach to researching cause and effect in medicine and, less commonly, education. The idea is to randomly assign test subjects to one of two or more conditions. In education, our test subjects would usually be students, although they could be randomised at a number of levels. If you have enough schools, you could randomise some schools into one condition and other schools into a different condition. These are the kinds of large-scale RCTs that the Education Endowment Foundation (EEF) in England carries out and that Evidence for Learning (E4L) conducts in Australia. In my own research, I randomise individual students into one of two conditions. This enables me to run far smaller RCTs.
The conditions should represent the thing you’re interested in testing. For instance, you might want to give your experimental group a differentiated instruction programme that you have developed.
One of the conditions should be a control group and this is where the devil often hides. If your control condition is rubbish – for instance, if it doesn’t consist of teaching children anything relevant to your outcome assessment – then you are likely to get a positive result for your experiment, regardless. Even a ‘business-as-usual’ control condition may fail because, in contrast, the intervention is a novelty with the ability to generate a placebo effect.
So the control needs to be ‘active’, representing a good alternative. Few education studies use an active control and many introduce additional factors that confound the result. That’s why, although not a magic bullet, I like the idea of having two experimental conditions. If they both outperform the control then we can look at which one outperforms it the most.
But this leads to a crucial, almost philosophical, question: How do we decide if an experimental condition has outperformed a control condition (or any alternative experimental condition)? At first sight, the answer seems pretty simple – check to see which group gets the best results on the outcome assessment. And it is perhaps that simple, provided that you repeat your experiment plenty of times and always get the same result. Why would we need to repeat? The difference could have arisen by chance. It is, after all, unlikely that any two mean scores from different groups of students will be exactly the same. However, it’s not always practical or desirable to keep running the same experiment, even if a few more replications of this kind would be a good thing for the field.
This is where p-values come in. A p-value is a probability. In null hypothesis testing, it is the probability that the difference between the means that the experimenters obtained would be as large as it is, or even larger, if there really was no effect of the intervention when compared to the control. If that sounds like quite a useful thing to calculate then that’s because it is. It can help inform your decision as to whether there is an effect.
Another way to think about it is as an error rate. If you set your acceptable p-value at 0.05 (i.e. 1 in 20) then this means that you will end up falsely declaring a positive result in one in 20 of your experiments in which there really was no effect. Add in replication and this seems like a reasonable way to proceed. If you want to, you can set a lower p-value such as 0.01 or 0.005. However, to find effects would then require very large groups of subjects or very large differences between the groups; both of which present their own problems.
Of course, a p-value is not everything. It won’t fix the problem of a crappy control. And if you run 20 experiments or, more commonly, measure 20 different things in one experiment and then only report one p-value for one measure, you could well be reporting a false positive. This is why many researchers are now calling for trials, and the measures they intend to report at the end of those trials, to be registered in advance.
There is also a group of researchers who suggest that there is a general misunderstanding of what p-values represent. They argue that everyone else thinks a p-value gives the probability of the null hypothesis being true when that’s not what it actually is.
This has caused some journals to go as far as banning p-values altogether. But this raises the question of what to replace them with. Without a good answer, the likely result is more false positives. That hardly seems like an advance.