To many readers, it probably seems arcane to argue about statistical tests. And yet such tests, through their role in research, have a material impact on the work of teachers.

A p-value is a probability. In fact it’s a special kind of probability known as a ‘conditional probability’ that depends on certain assumptions. Imagine, for instance, that we run a study where we randomly allocate students to one of two groups. The first group is taught maths in the standard way and the second group is given an abacus to use, while everything else about the teaching is kept the same. After the teaching section of the study, we test the students and find that the ones using the abacus get a higher average score. It is possible that the abacus has not actually made any difference (this is known as the ‘null hypothesis’) and we just happened to allocate a slightly more able mix of students to the abacus group, or perhaps the test is multiple choice and the students in the abacus group just got lucky when selecting answers.

In this case, we can ask ‘if the null hypothesis was true, what would be the probability of gathering data as extreme or more extreme than the data we have gathered?’ This calculation provides a ‘p-value’; a test of statistical significance that is the cause of much controversy in some research circles.

What the p-value does *not* tell us is the probability that the null hypothesis is true, because we have already assumed that it is. It also does not tell us the overall probability of obtaining the data that we have obtained, because we have restricted our calculation to a condition where the null hypothesis is true. So we cannot assign probabilities in a general sense (an alternative approach attempts to plug this gap by essentially guessing some of the factors needed to calculate these probabilities but I’m not convinced that this is much of an improvement).

When I talk about p-values, I am often told by p-value sceptics that I am mistaken. I don’t understand conditional probability, they may claim. They may repeat the points that I have made above that the p-value does not tell us the probability that the null hypothesis is true. They may suggest I read more so that I will understand better.

This is a classic ‘straw man’ argument in that it opposes claims that I have not made and, as such, it becomes increasingly surreal to attempt to address it.

Ultimately, I think that sceptics simply believe that p-values don’t tell us very much and, given that they are prone to misinterpretation, we should do away with them. I disagree.

Used with appropriate caution, they can be helpful. And I will expand on that shortly. But first, let’s look at what they cannot do.

Returning to the abacus study, p-values will not tell us whether our experiment was well-designed. If there is an effect, it is possible that the abacus acted as a placebo rather than imparting anything instructionally useful. If we assigned the two groups to two different teachers then any effect might be due to the teachers rather than the abacus. If there was no effect, then this might be due to the fact that students never even picked-up the abacus in class or teachers didn’t use it effectively. There are lots of potential issues that p-values cannot help us with

Moreover, imagine a world in which abacuses definitely have no impact on maths teaching, however well they are used. Imagine that lots of different research groups run studies in this world to assess the effect of abacuses. Roughly 1 in every 20 such studies should result in a p-value less than .05. If these are the only studies that get published then we have a problem.

There are two possible solutions. Firstly, we could publicly register each study in advance. That way, we would not be relying on only the published studies. Instead of seeing just the one statistically significant result, we would see it alongside 19 that were not. Secondly, we could try to replicate the published study. In a world where abacuses have no effect, our replication is not likely to produce a statistically significant result.

Statistical significance matters in many educational interventions, particularly if we are doubtful about any likely effect. Take the example of the Philosophy for Children (P4C) study run by the Education Endowment Foundation. The authors claimed this showed that ‘philosophy’ lessons for primary school children, where they discussed issues such as whether it was okay to hit a teddy bear, had a positive impact on standardised reading and maths scores. In this case, the null hypothesis is *highly plausible*. I would suggest it is even likely. How could such lessons affect reading and maths scores? A p-value would tell us how likely our data would be if the null hypothesis were true. If the answer is ‘reasonably likely’ then I see no reason to reject the null hypothesis. No, we haven’t calculated a probability that the null hypothesis is true but we have made a reasonable inference based on the data. And yet the researchers did not calculate a p-value.

In the particular case of P4C there is a further problem. The measure used to determine the effect was not one that was registered in advance so, even if the researchers had calculated a p-value, we would need to bear that in mind. It is possible to look at data multiple different ways and then only report the measures that seem to show an effect; a phenomenon known as ‘p-hacking’. There are ways of adjusting p-values to take account of the use of multiple measures in this way and so these would need to be applied in the case of P4C.

P4C has now been massively scaled-up on the basis of this one supposedly positive trial result. More schools will be introduced to it. Someone at your school may decide to introduce it in the belief that it is backed by strong evidence.

This could become a growing problem if p-value sceptics win the argument. Already, one quality journal has banned the use of p-values and there are suggestions that this may be leading to more false positive results i.e. we think there is an effect when there is not. If this trend continues, we could find more interventions making their way into the classroom based upon false positive results and that would be a backward step.

Instead, we should keep calculating statistical significance. The p-value sceptics have probably done a good job in publicising misconceptions around their use and raising researchers’ scepticism when confronted with dodgy conclusions based upon p-values alone. This heightened scepticism may actually make them more useful to us than similar measures that have attracted less publicity and that could therefore lead us into thinking we know things that we don’t.

Many of the ideas and techniques in inferential statistics such as the use of p-values are contested, e.g. by Educational researcher/sociologist Stephen Gorard (2014).

I have two ‘identities’ here – as an Educational researcher I am concerned about the way in which statistics are used, but it also concerns the material I actually teach – Mathematics and Statistics for Business School students. What I teach students is that p-value is not wrong in itself – it is the conditional probability of getting the result obtained (or ‘more extreme’) in a test under the assumption of the null hypothesis. There are many difficulties here: how is ‘more extreme’ defined and measured (obvious for mean scores, less so for chi-squared goodness-of-fit)?

Why is a 5% level of significance used? or 1% or 10%?

What about my ‘a priori’ belief that the null hypothesis is likely to be true?

Reference

Gorard, S. (2014) “The widespread abuse of statistics by researchers: what is the problem and what is the ethical way forward?”, Psychology of Education Review, 38, 1, 3-10. Available on-line at: http://dro.dur.ac.uk/11981/1/11981.pdf?DDD29+ded4ss+d700tmt

I think your a priori belief has to be justified. It is a tool for inference. It does not replace inference. Yes, 0.05 is arbitrary and it’s reasonable to argue for a different value. The key question for me is: If not p-values, then what? If the alternative means more false positives then I’m not keen.

As long as we’re doing frequentist statistics (not ideal, but probably what most people are going to use for at least the next ten years), I would argue for confidence intervals, preferably bootstrap ones. You can still do NHST with them, but they explicitly focus on the effect size and the uncertainty about the effect size. People don’t seem to misinterpret them as often.

Gorard is correct in avoiding misleading P values for observational studies using biased samples and inadequate controls. But he’s throwing the baby out with the bathwater, when he rejects them for randomised trials, where a decent effort has been made to control for bias. Readers and referees should check the trial design carefully, and be sceptical if there is an obvious source of bias, But if the design is good they need a P value, or confidence interval, to avoid assuming that every small chance difference between groups is a real effect of the intervention.

Non-parametric statistics uses the observed data in a random arrangement, repeatedly, to find out the cutoff

super-p-value.

This zero-math post tells all.

In particular, avoiding the problem of small sample sizes.

https://towardsdatascience.com/a-zero-math-introduction-to-markov-chain-monte-carlo-methods-dcba889e0c50

Greg,

You should do a piece on statistics once a month. The Gorard paper is odd because it goes from sample statistics can be problematic to they should be avoided completely in favor of significance tests. The example of the sampling of balls from a jar seems the basic idea behind all sampling -sure it only gives us probabilities but that was the point. Unless you do all experiments on entire populations you will need some application of sample stats.

A quick search turned up this

https://www.leeds.ac.uk/educol/documents/00002182.htm

which is useful explanation of the topic from twelve years earlier. It is a shame editors are not doing their job and policing use of statistics properly. It would also be good to highlight good examples along with the bad.

That didn’t come out right. My point about the sample of balls from the jar was that it is just fine as long as you keep in mind you are generating probabilities not certainties.

For me, the biggest misuses of null-hypothesis significance testing (NHST) involve failure to control for multiple comparisons, either from outcome-switching, or conducting multiple tests. If people pre-register studies, determine ahead of time what comparisons they are going to make, and compute appropriate p-values ahead of time, NHST can be useful, although I would still prefer results to be reported as point estimates with confidence intervals.

Great points. I would like to share a little gem of a replication/extension by Martens et al. (2010): http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0015280&type=printable

“Attentional blink” (AB) is a perceptual phenomenon afflicting all of us to varying degree. In brief, when two targets T1 and T2 are presented very nearly together in time (within a half-second) in a stream of distracting non-targets, folks typically have difficulty reporting that T2 occurred.

This study models an old-fashioned value that investing time and creative effort in research design is time and creative effort well-spent. The authors not only systematically varied target identities AND T1 duration AND T1-T2 lag time, but used a repeated measures design to cleverly “exploit” the well-established finding that robust individual differences exist in AB. In this way, the authors worked around or altogether purged potential sources of confound to resolve a longstanding cross-modal AB controversy. Their discussion/explanation of the above is masterful.

The authors also reported use of the so-called Greenhouse-Geisser technique — which I’ve never before encountered — to correct some repeated-measures ANOVA p-values. As I understand it, this tool adjusts for an issue with the “sphericity” of group data (whereby sums of squares do not total zero), invariably raising p-values and thus reducing the chance of Type 1 error.

Good clarity on what a ‘p’ value is. Its misuse is probably the problem, and separating the technical term ‘statistical significance’ from the effect size and confidence band is part of the solution…even so, I’m a little sceptical of any 1/20 p value…much happier with 1/100.

“(an alternative approach attempts to plug this gap by essentially guessing some of the factors needed to calculate these probabilities but I’m not convinced that this is much of an improvement).”

Except you start doing exactly that when discussing the P4C study:

“In this case, the null hypothesis is highly plausible. I would suggest it is even likely.”

As for p-values, my view on the matter is this:

– Throwing out p-values sounds like a weak attempt by bad researchers to justify throwing insignificant results out into the world. Bad idea.

– Lowering the threshold of 0.05 sounds like a good idea, but it will only encourage bad researchers to p-hack more.

More focus needs to be put (as you rightfully do!) on proper study design and methodology. A p-value is only meaningful if the study has been performed right. (This tells you a lot about people who claim p-values are meaningless…)

Even then, once you get a significant effect, the next question should be “How large is the effect?” Many people tend to confuse a significant effect (low p-value) with large effect. This is not the case, if your study is large enough even tiny differences will be significant, but do you really want to overhaul an entire school system just for an 0.0001% increase in teaching?

Crucially, my subjective judgement about the likelihood of P4C affecting reading and maths scores is available for all to see as a subjective judgement. It is not something that I have buried in a formula and given a mathematical sounding name. So there is less chance of thinking we know something that we don’t.

Reblogged this on The Echo Chamber.