My two favourite kinds of evidence

One of the things that supposedly makes me a despicable ‘neo’ traditionalist as opposed to a cuddly and buffoonish ‘traditionalist’ is my interest in empirical evidence. The traditionalist needs only the philosophy of Edmund Burke and a fetish for musty leather armchairs. No evidence required. I, on the other hand, want to slice and dice the human endeavour of education like a vivisectionist who kills the object of his inquiry.

Yet I don’t quite see it this way. I think good evidence is helpful to the advancement of an argument, particularly when we are being assaulted with bad evidence all of the time. And that’s the rub: I am not an evidence glutton, furiously feasting upon all the data that is out there, I am an evidence gourmet seeking the best quality that I can find.

Currently, I think there are two kinds of empirical evidence that have the most to offer. It is not that other kinds of research can’t do the business, it’s more that they don’t tend to.

Small-scale randomised controlled trials

Randomised controlled trials (RCTs) are meant to be the gold standard of education research. Randomly select students, classes or schools into either an experimental group or a control group and you should be able to see the relative effect of the experimental condition.

Yet there are a number of issues with this model. In large-scale RCTs, the researchers typically compare an intervention package with business-as-usual and the problem here is that more than one thing is varying at a time. For instance, imagine an intervention that involved additional maths problem-solving classes over a period of 12 weeks that shows a positive effect compared to a do-nothing control. Can we conclude that it was the problem-solving? Perhaps it was the extra classes? It is also impossible to ‘blind’ such a trial and so teachers and participants will know if they are taking part so we might just be witnessing a placebo effect.

There is a relatively easy solution to this; three conditions. One could be business-as-usual, another could be the problem-solving intervention and then a third could be an intervention of equivalent duration based on a different approach such as mastery. We would be likely to find positive effects for both interventions but the relative effects would tell us something about the specific approach. Despite there being nothing methodological to prevent this strategy, large-scale RCTs don’t tend to adopt it.

By comparison, small-scale RCTs are usually more scientific. These are conducted by psychologists and the more science-loving end of the education research spectrum. There is a tendency to assume that all test subjects in such RCTs are undergraduates and psychology students are clearly well represented. However, many such studies have also take place in schools. These studies tend to be less ambitious in scope but this means that we can generally hone in on cause and effect rather more convincingly. It is also less obvious to a student who has to read a passage and answer a question five times that they are in the experimental group whereas their friend who has to read five passages and then answer five questions is in the control group.

Yes, you could argue that small-scale RCTs are a little artificial or that long term effects are different to the shorter term effects that are typically captured in such studies but I find the proposed mechanisms for this to be unconvincing and in need of some robust empirical evidence of their own.

Natural experiments

This probably seems like a lurch in a completely different direction but I also think that natural experiments have a lot to offer. The beauty of collecting data from the wild is that you know that it is ‘ecologically valid’. In other words, you are measuring things that happen in real classrooms rather than an idealised, researcher-driven version.

The big flaw, of course, is that the evidence is a correlation. The fact that we have not randomly assigned participants to different conditions means that there could be some other factor that selectively acts on one group but not the other and that is not captured by the data. Yet when these long-scale correlational studies seem to triangulate with the results of small-scale RCTs – which they often tend to – we should sit-up and take some notice.

An example of such a natural experiment is the TIMSS study carried out by Schwerdt and Wupperman. They took advantage of the nature of TIMSS assessments. These are international tests that are similar to the better-known PISA tests but they only test maths and science whilst also collecting lots of questionnaire data about the students’ school experiences. The researchers were able to use the data set to find students who had experienced a different teaching style in their maths lessons compared to their science lessons and found that when a ‘lecture style’ was used in one class but not the other then the student performed better in the subject taught in this way.

Of course, we can’t rule out the possibility that, for instance, better teachers were more likely to use a lecture style. You can wonder about that for a while if you like. This is the nature of correlational studies. However, it does seem to replicate what we would expect to find if we were to take small-scale RCTs of the worked example effect and extrapolate them.


There is some pretty poor research in education, sadly. There’s a lot of impenetrable, jargon-laden sociology and there are also empirical studies that seemingly offer us very little. For instance, a famous set of maths studies compared the teaching of maths in one school with the teaching of maths in another school and then drew conclusions about how the style of teaching caused differences between the results, as if there is no other factor that might vary between two schools. Go figure that.


3 thoughts on “My two favourite kinds of evidence

  1. geoffjames42 says:

    As in your last but one paragraph, replication is the devil in the detail – seems that replication, necessary to confirm RCT evidence, is more often than not unachievable in psych. research and correlation is king. Psych journals not taking stats as confirmation of causation is highly significant (I think the might be some kind of technical pun). What does this mean? Causality is too slippery to catch/can be inferred but not confirmed/it’s time people got a grasp of critical realism and ontology? All good stuff, keep wondering …… Geoff

  2. I assume you’re referring to Jo Boaler’s paper that has garnered over 500 citations in your last paragraph. One of the most appalling pieces of “research” I’ve ever read. That study can simply tell you nothing of any worth about anything. In fact, an unbiased reading of the evidence leads you to the exact opposite conclusion that she drew: that discovery based methods are harmful to learning.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.