I am just starting out on my PhD study and one of the things that I need to come to understand is the use of statistics for analysing experimental data. I am therefore bound to make mistakes in this post as I try to tease out some of the issues. I would be most grateful if those more expert than me could take the time point out these errors.

Also, probability more generally is famously fraught and counterintuitive so I won’t be offended if you point out any basic howlers, whether you know about p-values or not.

**p-values**

One of the things that I had intended to do once I had experimental results was to calculate a p-value. What I hadn’t been aware of is the controversy surrounding p-values and their validity. To understand this, we need to understand the ideas of the null and alternative hypotheses. If we were testing a drug, for instance, the null hypothesis would be that it has no effect. The alternative hypothesis might be that it has a positive effect.

If you have any familiarity with basic high-school science then you can probably see that there is an immediate problem. The value of any conclusion will depend on a well-designed experiment. In this instance, we would want to rule out the possibility of a placebo effect as another viable alternative hypothesis.

Setting this aside for now, imagine we run an experiment and get a set of data that perhaps shows a lower mean death rate or whatever for people who take the drug. The p-value tells us the probability of gaining such a data set if the null hypothesis were true i.e. if the drug really had no effect. Different samples will always have different means and so our results could have come about by chance. In the language of probability, we have calculated the probability of our data *given* the null hypothesis or Pr(data|null).

**Controversy**

There are many and various reasons why people dislike p-values but the most fundamental is that they are based upon a misconception about probability. Critics say that what we *actually* want is the probability of the null hypothesis *given *the data i.e. Pr(null|data). This is *not* the same thing as the probability of the data *given* the null hypothesis, Pr(data|null).

In fact, I teach this notion to my high school maths students. The common textbook trope is to consider a diagnostic test that is not 100% accurate and a disease with a low prevalence in the population. It can be quite simply demonstrated that the probability of having the disease given a positive result on the test is not the same as the probability of a positive result on the test given the fact that somebody has the disease.

So this is something of a coup de grace for p-values.

**How do people use p-values?**

Or maybe not.

It might be worth conducting a few little thought experiments here. I had never actually considered the p-value as an estimate of the probability of the null hypothesis given the data. I had tended to think of it as something that would inform my judgement about that, alongside other factors.

For instance, imagine a typical cognitive load type experiment. We take a sample of students and randomise them into two groups. The first group receives written instruction – perhaps a worked example or something – whilst the second group gets the same written instruction but also a simultaneous audio track of someone reading it out. The students subsequently complete some kind of post-test.

The null hypothesis here would be that there is no effect of being assigned to either condition. Which is pretty plausible. An alternative hypothesis might be that both reading and hearing simultaneously will overload the students in some way and so the performance of the second group will be worse than the first. Imagine that we do indeed find that this is the case and we calculate a p-value of 0.05.

This tells us that, in a world where the null hypothesis is true, the likelihood of getting a data set like our is 5%. If we replicated this experiment a number of times and got similar results then these results would not seem very consistent with such a world. Replication and simply conducting a larger experiment are mathematically the same and so we could, if we wished, normalise our data and try to estimate an overall p-value which would be shrinking in this case. This ‘increasingly inconsistent’ finding might affect our more qualitative view of the likelihood of the null hypothesis being true.

Of course, our original 5% value is not the probability of the null hypothesis being true. This would be committing the fallacy described above.

**Learning what you teach them**

Now let’s consider a stranger experiment. We randomise students who have never learnt about the ancient Egyptians into two groups. We teach one group about the ancient Egyptians whilst teaching the other group nothing at all. We then test them on knowledge of the ancient Egyptians.

The null hypothesis would be that teaching students about the ancient Egyptians has no effect on their knowledge of the ancient Egyptians. Imagine that we find that the students taught about the ancient Egyptians knew more about them with a p-value of 0.05. This means that *in a world where teaching children about something does not effect what they know about it* the chances of getting our data set is 5%. However, we might conclude that such a world is highly unlikely and so this 5% value doesn’t really tell us much at all.

The thing is, it would be an odd experiment to conduct because we already kind-of know the answer.

**Philosophy for children**

Finally, let’s consider a real experiment which has shaped my thinking on this. EEF conducted an RCT of a course known as ‘philosophy for children’. There has been much comment on this with some people claiming that it demonstrates regression to the mean. However, let’s set this aside and take the experiment at face value.

Children were randomised into two groups, one of which received a philosophy course. Students in the latter group saw greater gains on their mathematics performance (and other measures) than students in the control group.

No p-value was calculated because the lead researcher, Stephen Gorard, is a prominent critic of them.

The null hypothesis here would be that the philosophy course has no effect on children’s mathematics performance. To me, this seems highly plausible. It is hard enough to achieve transfer *within* educational domains let alone *across *them. Therefore, I think that a p-value would have added information here. If we knew, for instance, that in a world where philosophy courses *do not* affect mathematics achievement then the likelihood of getting this set of data is, say, 20% then I would be inclined to put it down to chance and ask for more replications.

I’m really glad you have brought this up as not only does the issue of the p value bother me in terms of how it is used or not used, there is also a serious issue following on from your analysis, namely the way data is presented.

I have seen a lot of mixed p value data tables (number of stars indicating the different p-values), which is something I have always been told is a ‘no, no’ as it’s a dishonest way of presenting data (on top of any other research design flaws, etc which may affect it). Ultimately, it is about encouraging people, especially lay people, to make comparisons which are not true.

I showed a couple of these tables to the other half, who is a physicist, to get his take on it, as it has been a while since I studied or analysed a dataset. He agreed that there is no way that he could present results in that manner and get published!!

So the question is why is it that I was taught this during A-Level’s in psychology and during my quantitative methods course for my Master’s. The other half pretty much since he was 18 years old onwards was told this, yet we are seeing this in education data? More to the point, how is that researchers are not pulled up for this?

It is yet another reason that I feel that some education research is deeply flawed and those conducting it seem to have a very different idea and possibly a lower standard in terms of how they go about the research and presenting it.

I think your interpretation is right. Low p-values are “necessary but not sufficient” evidence for the alternate hypothesis (the intervention working). It’s a sniff test.

If you have a high p-value, as you say, we are suspicious that the result might be entirely due to chance. If you have a low p-value, this doesn’t tell us that the alternate is true – it might still be chance, there might be flaws in the experimental design including confounders or measurement errors, there might have been p-hacking.

I’ve been wading through Gorard’s work this week (including that response by Neale that *does* argue you can use low p-values as evidence for the alternate, which I haven’t got my head around). Regarding that EEF study, I think he’s worried that if you use a canary to test the air in a coal mine, people feel safe when the canary is fine but they shouldn’t be (or worse, they’re hanging canaries in office buildings to “test” air quality there”. If I understand you correctly, you’re saying “yeah but if you take out the canary to stop overconfidence or abuse, we now miss the information the canary used to give us”.

And this is a completely open debate, waged in far higher circles than educational science, though the literature that I’ve read this week so far (on both sides) seems to be vauge onf outside references. e.g. https://www.sciencenews.org/blog/context/p-value-ban-small-step-journal-giant-leap-science

You’re writing as though there is a generally accepted controversy with the use of p values. There is no such controversy in the world of Maths or Science. The use of p values is nearly a century old and is widely accepted, taught in every Maths undergraduate course, written in all Statistics text-books. It was recently used to prove the existence of the Higgs Boson particle. There is no discussion in Mathematical circles about whether it is correct or not.

The only place this discussion is taking place is by a few Educationalists and Psychologists. They usually have Psychology degrees and no training in Statistics, Stephen Gorard is a classic example.

The closest analogy I can get to is in Science, is, if a group of Educationalists with Psychology degrees started to criticise Einstein’s Theory of Relativity and say it was wrong. We would be bent double laughing at their arrogance and stupidity. This is what should be occurring here.

Agree with larrylemonmaths, there’s no serious controversy here. As criticalnumeracy says “low p-values are “necessary but not sufficient” evidence for the alternate hypothesis (the intervention working)”.

We believe the Higgs boson exists because the p-value was low, AND there was a sensible underlying theory AND the CERN experiment was well designed.

If a low p-value had been obtained for the effect of “philosophy for children” on maths achievement, sensible people would remain sceptical because the effect is implausible and the experiment was badly designed and analysed (regression to mean, loss to follow-up, multiple outcomes etc.).

More here. http://ripe-tomato.org/2015/07/14/teaching-philosophy-in-primary-schools/ and here http://ripe-tomato.org/2015/07/21/more-philosophy-for-children/

Surely there’s a better argument for something to be established and uncontroversial. I can think of a number of educational theories that are “nearly a century old and [are] widely accepted, taught in every … undergraduate course, written in all [the] text-books” that I wouldn’t give the time of day to.

I think it’s fair to say p-values are controversial due to the recent p-hacking scandals. It’s clear that through deliberate or accidental experimental design and the process of how studies are published or not (“file drawer effect”), reported p-values are an underestimate for the probability that results are due to chance under the null hypothesis. We’re still at “necessary but not sufficient” – if your p-values aren’t significant, don’t expect me to take your results seriously, but even if you do have significant p-values, I’m still skeptical. I personally wouldn’t throw the canary out with the bathwater, I still want to be able to reject your results if they aren’t significant, but I sympathise with the insurgent campaign to uproot the common fallacy that a low p-value means you can place faith in a study’s results not being the result of randomness.

(Greg – you should read Andrew Gelman’s piece on the garden of forking paths to see how well-intentioned and non-fraudulent studies can still p-hack http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf.)

The significance level of 95% is also well-known to be completely arbitrary and potentially inappropriate. Note that the Higgs-Boson was found at 5 sigma not 2 sigma (i.e. 95%) which corresponds to a result due from chance at 1 in 3 million vs. 95% which is 1 in 20. So there’s a debate about what constitutes an appropriate threshold, in light of the p-hacking which drives the actual chance up.

There’s no controversy in how it’s calculated, it’s the use, abuse and misinterpretation.

What you are trying to do here is reinvent bayesian statistics. This is a useful introduction http://www.yudkowsky.net/rational/bayes