Filling the pail

Experiments aren’t everything

Embed from Getty Images

In my own PhD research, I run randomised controlled trials (RCTs). These involve setting up two or more experimental conditions, varying only one factor between them and then randomly assigning subjects – in this case, students – to each of the conditions. RCTs are considered the gold standard for working out if one thing causes another because you manipulate just that one thing and nothing else. By randomly assigning students, we know that there are no other systematic differences between the members of the groups that could account for any difference in outcomes.

You may therefore expect me to be an evangelist for experiments. You might expect me to take a dim view of other ways of trying to establish cause and effect. But that’s not quite right.

I am also impressed by correlations and the evidence that correlations provide adds to the evidence we have in education. It is true that correlation does not equal causation, but this is the starting point for a discussion rather than the end point. I mentioned correlational evidence on Twitter recently and Dylan Wiliam responded with a link to this paper by Austin Bradford Hill, written in 1965 and addressing correlations in medicine; a field commonly known as ‘epidemiology’. It makes for an interesting read.

It’s worth outlining the key problem with correlations: We might find that as one thing changes, another thing also changes; it rises or falls. However, this does not necessarily mean that the first thing caused the change in the second thing. Three possibilities are worth illustrating:

  1. There is no relationship and its just chance that the two things correlate. I may, for instance, note a pattern where shorter teachers tend to have more pens in their pockets than taller teachers, but this may just be due to the particular teachers I have sampled. If we repeated it with a different group of teachers we may find no pattern or a reversal of the pattern.
  2. Changes in both things are actually caused by a third factor. An example of this might be the discovery that living in Florida correlates with an increased risk of dementia when compared to the U.S. average. It this case, living in Florida does not cause dementia. Instead, it would likely be the fact that Florida has a larger proportion of senior citizens when compared to other states, and that senior citizens are more likely to get dementia, that is the cause of Florida’s increased rate of dementia.
  3. The arrow of causation may point in a different direction to the one we assume. For instance, we might see a correlation between students’ motivation for mathematics and their maths ability. Perhaps motivation causes students to work harder and this increases their ability. Alternatively, being more able at mathematics might cause students to be more motivated. Both sound plausible and there may even be a causal arrow pointing in both directions; a virtuous circle.

Given such issues, why not toss out correlational research altogether and simply conduct experiments?

The answer is that experiments are hard to do. Big, long experiments are particularly hard to do so if we want to know the effect of a policy change in a state education system then running a true experiment is virtually impossible. This would not be such a problem if experiments were perfectly scalable; if small, short experiments just generated little versions of the results we get with bigger, longer ones. But there is much reason to doubt this. Lab-based findings rarely have a smooth path towards implementation as a long term policy change.

In contrast to the difficulty of running large experiments, it’s pretty easy to amass correlational data and it’s getting easier all the time in this data-rich age. Correlations can also circumvent some of the ethical issues with experiments, such as when one group of students perhaps has to miss out on a promising intervention in order to act as a control. You can also often have a sort-of control group for correlational data; a ‘quasi-experimental’ design. For instance, regression discontinuity is a technique where a small change causes an individual to flip from one category to another. Imagine two children, the first of whom is born on the 31st December and the second who is born on the 1st of January. If the cut-off date for school entry is the 1st of January then, although the two children have very similar ages, the January child will have a whole year more of schooling. A different kind of quasi-experiment might involve two neighbouring districts adopting the same policy change at different times, with the late adopter then acting as a control. These two examples are drawn from a paper by Stuart Ritchie and Elliot Tucker-Drob that analyses the effect of education on general intelligence.

Correlations also have the advantage of testing real-world examples. In education, we are plagued by bad experiments where a gold-plated version of the favoured intervention is tested against a do-nothing or bog-standard control. It is probable that an inferior teaching method, delivered with lots of thought and plenty of commitment, will fare better than a mediocre enactment of a technically superior teaching method. Correlations can tell us something about everyday, ordinary examples of the two approaches under investigation e.g. this study of science teaching methods.

However, we are still left with the cause and effect problem. Bradford Hill offers some useful suggestions for evaluating correlational data but some of this is clearly most relevant to medicine and public health. I would like to focus on just a few things that I would look for when assessing the validity of inferring cause and effect from a correlation.

Key is what Bradford Hill refers to as ‘consistency’ and what we might also term ‘replication’. If we see this correlation in a range of different situations then we can probably rule out the idea that it’s a chance finding. For instance, if three quite different states adopt the same education policy at different times and, subsequently, maths scores rise in each of these states then that would seem to be telling us something. It is particularly convincing if we can take a correlation and replicate it in an experiment.

An example of this would be the process-product research of the 1960s and 1970s that sought to correlate various teacher behaviours with test score rises. A number of behaviours were identified that we might broadly term ‘explicit teaching’. However, these could just have been proxies; a particular teacher personality type, for instance, might have caused teachers to teach in a particular way and also have caused the test score gains. To try to figure this out, we could and should ask whether it is plausible that teacher behaviours cause student learning and of course it is – a plausibility test. However, that still doesn’t rule out a third factor.

Which is why a number of researchers set up experiments (e.g. here) where they taught teachers these behaviours and then looked to see if these teachers’ students performed better than a control group. We still have a problem if these experiments are badly designed but if we have a large number of correlations and reasonably well-designed experiments all pointing the same way then I think it is reasonable to infer a cause and effect relationship.

Ultimately, our inferences should depend upon triangulation. It is about more than exactly replicating an experimental finding. To be reasonably sure of a cause and effect relationship, we need to see similar effects in a range of different correlations of different designs, sizes and duration, ideally supported by experimental evidence. It’s a lot to ask for but I think we have the tools at our disposal to amass such evidence in a way that is relevant to common debates about teaching approaches.