RCTs and p’s


Randomised controlled trials (RCTs) are a scientific approach to researching cause and effect in medicine and, less commonly, education. The idea is to randomly assign test subjects to one of two or more conditions. In education, our test subjects would usually be students, although they could be randomised at a number of levels. If you have enough schools, you could randomise some schools into one condition and other schools into a different condition. These are the kinds of large-scale RCTs that the Education Endowment Foundation (EEF) in England carries out and that Evidence for Learning (E4L) conducts in Australia. In my own research, I randomise individual students into one of two conditions. This enables me to run far smaller RCTs.

The conditions should represent the thing you’re interested in testing. For instance, you might want to give your experimental group a differentiated instruction programme that you have developed.

One of the conditions should be a control group and this is where the devil often hides. If your control condition is rubbish – for instance, if it doesn’t consist of teaching children anything relevant to your outcome assessment – then you are likely to get a positive result for your experiment, regardless. Even a ‘business-as-usual’ control condition may fail because, in contrast, the intervention is a novelty with the ability to generate a placebo effect.

So the control needs to be ‘active’, representing a good alternative. Few education studies use an active control and many introduce additional factors that confound the result. That’s why, although not a magic bullet, I like the idea of having two experimental conditions. If they both outperform the control then we can look at which one outperforms it the most.

But this leads to a crucial, almost philosophical, question: How do we decide if an experimental condition has outperformed a control condition (or any alternative experimental condition)? At first sight, the answer seems pretty simple – check to see which group gets the best results on the outcome assessment. And it is perhaps that simple, provided that you repeat your experiment plenty of times and always get the same result. Why would we need to repeat? The difference could have arisen by chance. It is, after all, unlikely that any two mean scores from different groups of students will be exactly the same. However, it’s not always practical or desirable to keep running the same experiment, even if a few more replications of this kind would be a good thing for the field.

This is where p-values come in. A p-value is a probability. In null hypothesis testing, it is the probability that the difference between the means that the experimenters obtained would be as large as it is, or even larger, if there really was no effect of the intervention when compared to the control. If that sounds like quite a useful thing to calculate then that’s because it is. It can help inform your decision as to whether there is an effect.

Another way to think about it is as an error rate. If you set your acceptable p-value at 0.05 (i.e. 1 in 20) then this means that you will end up falsely declaring a positive result in one in 20 of your experiments in which there really was no effect. Add in replication and this seems like a reasonable way to proceed. If you want to, you can set a lower p-value such as 0.01 or 0.005. However, to find effects would then require very large groups of subjects or very large differences between the groups; both of which present their own problems.

Of course, a p-value is not everything. It won’t fix the problem of a crappy control. And if you run 20 experiments or, more commonly, measure 20 different things in one experiment and then only report one p-value for one measure, you could well be reporting a false positive. This is why many researchers are now calling for trials, and the measures they intend to report at the end of those trials, to be registered in advance.

There is also a group of researchers who suggest that there is a general misunderstanding of what p-values represent. They argue that everyone else thinks a p-value gives the probability of the null hypothesis being true when that’s not what it actually is.

This has caused some journals to go as far as banning p-values altogether. But this raises the question of what to replace them with. Without a good answer, the likely result is more false positives. That hardly seems like an advance.

Advertisements

12 thoughts on “RCTs and p’s

  1. Although as always there is much to like – but there would appear to be an issue with the use of p-values

    In the post the following statement is made

    A p-value is a probability. In null hypothesis testing, it is the probability that the difference between the means that the experimenters obtained would be as large as it is, or even larger, if there really was no effect of the intervention when compared to the control. If that sounds like quite a useful thing to calculate then that’s because it is. It can help inform your decision as to whether there is an effect.

    Unfortunately according to the American Statistical Association https://amstat.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108?needAccess=true the statement re p values is incorrect. As a p-value measures the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

    The ASA go onto state that : P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. Mistakenly, researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.
    .
    Again according to the ASA : a p-value, or statistical significance, does not measure the size of an effect or the importance of a result. Statistical significance is not equivalent to scientific, human, or economic significance. Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect. Any effect, no matter how tiny, can produce a small p-value if the sample size or measurement precision is high enough, and large effects may produce unimpressive p-values if the sample size is small or measurements are imprecise. Similarly, identical estimated effects will have different p-values if the precision of the estimates differs

    If we turn to Greenland, S., Senn, S., Rothman, K., Carlin, J., Poole, C., Goodman, S. and Altman, D. (2016). Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations. European journal of epidemiology. 31. 4. 337-350 we can see some useful explanations of common misconceptions

    Misconception No 1 The P value is the probability that the test hypothesis is true; for example, if a test of the null hypothesis gave P = 0.01, the null hypothesis has only a 1 % chance of being true; if instead it gave P = 0.40, the null hypothesis has a 40 % chance of being true. No! The P value assumes the test hypothesis is true—it is not a hypothesis probability and may be far from any reasonable probability for the test hypothesis. The P value simply indicates the degree to which the data conform to the pattern predicted by the test hypothesis and all the other assumptions used in the test (the underlying statistical model). Thus P = 0.01 would indicate that the data are not very close to what the statistical model (including the test hypothesis) predicted they should be, while P = 0.40 would indicate that the data are much closer to the model prediction, allowing for chance variation.

    Misconception No 7 Statistical significance indicates a scientifically or substantively important relation has been detected. No! Especially when a study is large, very minor effects or small assumption violations can lead to statistically significant tests of the null hypothesis. Again, a small null P value simply flags the data as being unusual if all the assumptions used to compute it (including the null hypothesis) were correct; but the way the data are unusual might be of no clinical interest. One must look at the confidence interval to determine which effect sizes of scientific or other substantive (e.g., clinical) importance are relatively compatible with the data, given the model.

    P values are tricky things and as Greenland et al state – there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so—and yet these misinterpretations dominate much of the scientific literature.

    As for what should replace p values – my understanding is that we should be looking at effect sizes and pracitical significance – but that’s a completely different post

    1. You said
      “Greg said
      “In null hypothesis testing, it is the probability that the difference between the means that the experimenters obtained would be as large as it is, or even larger, if there really was no effect of the intervention when compared to the control.”
      But the ASA says
      “a p-value measures the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.””

      I’m sure I’m not the only one that looked at this and said this is the same thing.

      1. I share your confusion. I think there are two issues here that we may be conflating. It is perfectly reasonable to disagree with me about the usefulness of p-values, as it would be to disagree about any other statistical test. However, that is not the same as suggesting that I have written something that is factually incorrect. If this is the case then I would like to know what it is so that I can learn from this. However, as you state, Mitch, it’s not clear that my statement is at odds with the ASA. This conflict needs to be explained in order to demonstrate my error.

        Are the statements about “Misconception No 1” and “Misconception No 7” supposed to relate to what I have written? I have not claimed that p gives the probability that the null hypothesis is true and I have not claimed that a low p-value “indicates a scientifically or substantively important relation has been detected”. So this feels a little like a straw man argument, unless I am missing something.

        It is genuinely baffling.

      2. My understanding is that these things are not the same.

        ‘it is the probability that the difference between the means that the experimenters obtained would be as large as it is, or even larger’ – my understanding is that a p-value provides a way s one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions. – The issue here is the p value is about the statistical model not the relationship between the intervention and the so-called effect

        ‘if there was really no effect of the intervention when compared to the control’ – agains as I understand it – a p-value cannot tell you whether there was really no effect – as all a p-value does it tell you about the compatibility between the data and what is predicted by a statistical model. As Greenland et al state ….’it says nothing specifically related to that hypothesis unless we can be completely assured that every other assumption used for its computation is correct.’ and most of the time we do not know this.

        I hope this makes sense

        Gary

      3. I have used the term ‘effect’ in the same way as the ASA when it states “the null hypothesis postulates the absence of an effect”,

  2. I have not written anything that is incorrect. I have not claimed that p measures the probability that the null hypothesis is true. What I have written is consistent with this statement from the ASA:

    “The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Oftenthe null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.”

    1. Hi

      I’ve been re-thinking about the post and what you have written and possibly the issue is in the phrasing of two statements that may lead to competing interpretations of what is meant. In other words, as Greenland argues there are no interpretations of these terms which are simple, intuitive, correct and foolproof.

      So let’s take each of the statements in turn

      Statement one

      A p-value is a probability. In null hypothesis testing, it is the probability that the difference between the means that the experimenters obtained would be as large as it is, or even larger, if there really was no effect of the intervention when compared to the control. If that sounds like quite a useful thing to calculate then that’s because it is. It can help inform your decision as to whether there is an effect.

      For the reader to fully make sense this paragraph, certain assumptions are made e.g., the reader

      has an awareness of what p values are
      understands the notion of probability
      understands correctly null hypotheses testing
      understands how to interpret the results of null hypothesis testing
      understands the relationship between null hypothesis testing of a particular statistical model and the relationship with an intervention and its results.
      is aware of the use of p-vlaues to inform your decision as to whether there is an effect

      However, it would appear that the way in which the statement is phrased creates the impression that a ‘traditional’ notion of p-values is being used rather than the one used by both the ASA and Greenland et al.

      So let’s now explore individual sections of the paragraph

      It is the probability that the difference between the means that the experimenters obtained would be as large as it is, or even larger . …

      This phrase as it stands would appear to be incomplete, as it does not make reference to a specified statistical model and underlying assumptions. As such, the phrase used creates the impression what is being measured is the relationship between the intervention and the outcome – rather than the relationship between the data observed and the statistical model used.

      …. if there was really no effect of the intervention when compared to the control

      Again from my reading of the ASA guidance p values cannot tell you whether there is no effect of the intervention compared to the control. All the p values tells you is the compatibility of the results with the underlying statistical model. I suppose it’s a bit like confusing the map with the territory – with the calculation of the p values being the map and the relationship between the intervention and the effect being the territory

      ….it can help inform your decision as to whether there is an effect.

      Again my reading of the ASA guidance – would suggest that this statement is potentially misleading as – A p-value, or statistical significance, does not measure the size of an effect or the importance of a result and by itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

      So maybe an additional sentence should have been added to the statement, which states that on its own the p-value tells you very little

      Statement two

      Another way to think about it is as an error rate. If you set your acceptable p-value at 0.05 (i.e. 1 in 20) then this means that you will end up falsely declaring a positive result in one in 20 of your experiments in which there really was no effect.

      Again from my reading of this statement – and bearing in mind this statement comes after having reservations about the statement regarding p-values – this creates the impression that you are arguing that the probability that you are in error of declaring a false positive is 0.05 or 5%. Whereas, if you reject it, the chance you are in error is 1 or 100%

      To conclude

      It seems to me that they way these statements are phrased in such a way in that it may lead teachers who do not have a sound grounding in statistics may continue to ‘believe’ in some of the misconceptions identified by ASA and Greenland. Another way of looking at is, is that one of the roles of edu-bloggers in knowledge brokering is to make things as simple as we possibly can, but no simpler. On this occasion, it’s my view you may have possibly fallen on the wrong side of this divide,

      Regards

      Gary

      1. You seem to have shifted from claiming that I have made errors to stating that my explanations are misleading. I disagree.

        Your statement about error rates is false. How can I ‘create the impression’ of something I’ve not said? I have not claimed it is the absolute error rate, I have claimed it is the error rate for studies where the null hypothesis is true.

  3. Greg

    We could spend a lot of time, arguing that we don’t understand what each other is claiming, or that you and i have shifted positions etc etc. On the other hand, we could try and come up with statements which we both agree on. So with that in mind – I’ve taken your first statement and subsequently re-written it using extracts from the ASA guidance, Wasserstein’s guidance notes and Greenland et al

    So here goes

    A p-value is a probability. It is the probability under a specified statistical model that a statistical summary of the data (e.g., the that the difference between the means that the experimenters obtained) would be equal to or more extreme than its observed value. The most common context in which p-values are used is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions. However, an individual p-value does not measure size of the effect or importance of the result. Indeed, by itself an individual p-value does not provide a good measure of evidence regarding a model or hypothesis and should only be used in conjunction with other evidence when making a decision.

    What do you think?

    Gary
    PS I haven’t had time to do the same with the second statement I have referred to – hope to do that later today

  4. Greg, there is a lot of discussion to be had on the subject of p-values but stepping back a little: when you look at actual results from a control group and a test group (or between two test groups if you are, say, comparing two teachers) do you ever produce ‘significant’ p-values. I ask because I have tried this on numerous times with my AS/A2 classes (for e.g. direct instruction vs.flipped classroom, CAST vs graphs for solution of trig equations) and I have never found any significant difference between sets of results. The reason is that my results are always so ‘noisy’ e.g. students results ranging from 2% to 100% that any differences that may be there are hidden in the variation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.