It can be hard to make sense of an education research paper and there are certainly those who think teachers shouldn’t worry their pretty little heads about them. Yet if we intend to claim our profession back from quixotic consultants then it’s something that more of us need to do.
There are plenty of dense papers written that aim to show that the curriculum is neoliberal or that tests are phallic. There are also many studies that seek to find correlations. I’m not going to discuss these here. Instead, I am going to look at experimental research.
Experiments offer us the chance of us making meaningful inferences – although it’s only a chance. Experiments need to compare a particular approach with an alternative. For instance, one group of students might be taught maths in the morning and the other in the afternoon. You can then judge the difference that this does or does not make.
1. Read the abstract, methods then discussion
Once I have established that a paper has been peer-reviewed (ERIC can help determine this – I really wouldn’t bother with papers that haven’t passed this level of professional quality control), I read it in this order: abstract, methods, discussion. The results section often becomes pretty technical and hard to follow but it’s still worth a look in case something pops out – sometimes there are interesting non-effects hiding in there.
What claim is being made in the abstract? Key results like effect sizes and tests of significance should be in there. If not, we might wonder why. We’ll return to these later.
Is the claim that students simply enjoy a particular teaching method? If so, I wouldn’t bother too much. Anything can be made enjoyable or boring for the duration of most experiments. Focus more on claims that one method leads to more or better learning.
2. How are participants selected?
The best way to select participants for a study is to do so randomly. It is then a matter of chance whether students end up in the experimental or comparison groups. We can still make useful inferences from non-random studies but we do have to be aware of possible bias.
If Mrs Smith’s class get the experimental biology teaching condition and Mr Jones’s class do not then in what other ways do these classes vary?
We might try to control for this by doing a pre-test and showing that both groups start out with the same science knowledge. But what if it turns out that Mr Jones’s class is scheduled at the same time as Latin so all the Latin students are in Mrs Smith’s class? And the biology topic involves lots of Latin words…
It’s almost impossible to anticipate all of the ways in which groups may systematically vary which is why random selection is better.
3. What does the comparison condition look like?
Possibly the biggest problem that we have is poor quality comparisons. This links to point 4 on ensuring a fair test.
Imagine researchers go in to a school and ask for teachers to volunteer to take part in a study. These teachers end up delivering the experimental condition and are compared against the other teachers who deliver the old course. Teachers who volunteer may well be more enthusiastic or interested in the process of teaching. They may simply be better teachers. This, rather than any detail of the experimental approach, might account for a positive result.
Many studies tend to use comparison methods that do not represent the best possible alternative. For instance, ‘active learning’ conditions involving college students are often compared to lectures and find a positive effect. This is fine as far as it goes but we are not justified in inferring that active learning is more effective than direct instruction because direct instruction is not lecturing.
4. Is only one thing being varied between the conditions?
A key principle of science is that we vary only one thing at a time. In the social sciences, this prompts the question: what is a thing? Often a whole package of measures is compared to not doing anything at all. This may still be of use but it can also lead us into error.
For instance, imagine our package involves giving teachers professional development (PD) on maths and then asking them to teach using an abacus. If we see an effect then it’s tempting to attribute this to the use of the abacus. Yet we know that simply giving teachers PD – making them think more deeply about maths teaching – is likely to have an effect. We might reason that this will be even more likely in elementary schools where teachers are often non-specialists. A fair test of the abacus would need to have a comparison group where teachers received the same amount of maths PD but didn’t use the abacus.
You see a similar problem when children in an intervention group simply get more of something e.g. additional reading tuition. Is it the form of the tuition or just the amount? I’ve also seen studies where the experimental and control groups are taught different topics in different ways. It’s hard to conclude anything from that.
5. What’s the outcome test?
My first principle of educational psychology is: students tend to learn the things you teach them and don’t tend to the learn the things you don’t teach them.
Imagine you ran a study where one group of students were given experiments to conduct involving balls and ramps in order to learn physics concepts and the other group was taught these concepts through teacher exposition. The students are then set a test that is all about conducting experiments with balls and ramps. The first group perform better and so the researchers conclude that the first approach is better for learning scientific inquiry skills.
This is flawed because the comparison group have had no opportunity to learn the thing being tested.
Often, research studies will include examples of test questions in a description of the method or appendix so look out for these.
6. Interpreting results
Unless you have pretty good statistics knowledge, you are probably limited to looking for statistical significance and effect sizes.
Statistical significance is usually given by a p-value (or confidence intervals which are equivalent). If p<0.05 this means that if the experimental condition actually had no effect then the chance of obtaining the results that this experiment obtained is less than 5%. This is useful. However, if the researchers ran the experiment 20 times then you would expect one result like this (which is why publication bias is such a problem). Similarly, if 20 outcomes were measured, you’d expect one to have p<0.05 by chance. Very large numbers of participants are also likely to generate a small p. Finally, if the chance of the experiment having no effect is very small then the p-value is pretty meaningless.
Effect sizes have been much debated. I don’t think we can treat them all the same in the way that John Hattie does – effect sizes from poorly designed experiments tell us little because the effect is probably a result of the poor design. We also need to be aware that effect sizes are smaller with older students than younger ones and with standardised tests as opposed to tests created by the researchers. There is nothing shady about this last point – standardised tests often don’t focus as much on the topic studied as part of the experiment. We just need to bear these points in mind.
Potentially, we can use effect sizes to make up for the problem of poorly controlled studies. If reading intervention A generates a larger effect compared to doing nothing than reading intervention B then we might conclude that A is superior to B. This isn’t as strong as if we had run both interventions in the same experiment.
You can read more of my thoughts on statistical tests here.
7. A meta-test
Finally, if you can’t establish whether a paper has been peer-reviewed or you can’t answer my points 2-5 above then I would take this as a bad sign.
The practice field
Why not try my analysis on the papers below: