My thoughts on meta-analysis have changed over the years. My first encounter was through John Hattie’s 2008 book, Visible Learning, and I was aware of some key objections from the outset because Hattie outlines them himself, before going on to rebut them. The Hattie hypothesis is that even though the studies he uses have different subjects, designs, durations, assessment instruments and so on, this all washes out if you look for effect sizes above the typical effect size of about 0.4 standard deviations. To a teacher new to research, this seemed like a reasonable argument.
In the intervening years, I have realised just how noisy education data is and how much effect sizes depend on particular conditions. If you want a large effect size, run a quasi-experiment with a narrow ability range of very young children and make sure that you design the outcome assessment. On the other hand, if you run a properly controlled randomised controlled trial (RCT) with older teenagers and assess them with a standardised test, you are doing well to get any positive effect at all. You can’t just mush all these different kinds of studies together as if they are equivalent.
An evolution of Hattie’s approach might be to find typical effect sizes for all of the different kinds of studies, subjects and assessments. We could then modify the effect size threshold that we look for. So 0.4 might be appropriate for quasi-experiments in primary schools, for instance, and 0.1 might be the threshold for RCTs with high school students. We could draw-up a table. I don’t see anyone working on this and I think it would be challenging. We would lack enough studies to generate these typical levels in many areas and even once we have categorised study types, a bad study may still generate a larger effect than a good study of the same kind.
An alternative to the Hattie approach is to be very selective about the studies that you look at by excluding studies that don’t meet set criteria. We might call this the What Works Clearinghouse method. And yet this poses its own problems. You can end-up making statements that ‘there is no evidence to support’ something when what you actually mean is ‘there is no evidence that fits our selection criteria’. It therefore leaves out a lot of evidence that may be indicative, if not the final say on the matter. As a consumer of research, I would rather know about these inferior studies alongside their limitations than be led into thinking that no such studies exist.
I also want to know about correlations, single-subject research or any other kind of research that might have a bearing on my decision-making. I was blown-away a few years ago to read about research conducted by Kevin Wheldall and colleagues on seating arrangements. Briefly, they changed the usual arrangement of rows or groups in a classroom to the alternative arrangement before changing them back again, all the while monitoring on-task behaviour. It makes quite a compelling case for the value of seating students in rows but it is not a traditional RCT or quasi-experiment. Do we want to filter out all such research and fool ourselves into thinking it’s never been conducted? It is such exclusion/inclusion criteria that may have led to the strangely ahistorical nature of educational research that I once saw Thomas Good, veteran of process-product research, complain about at the ICSEI conference in Cincinnati.
I am fairly clear on what we should be doing to resolve these issues in the medium term. If we want to know whether teaching calculus through dance is better or worse than teaching it on horseback, then we should run an RCT with three conditions; dancing, horseback and the standard approach. This is far superior to comparing the effect size of dancing versus a control with the effect size of horseback versus a control because we know that all the other conditions are the same. There are still dangers and we still need to look at the detail to ensure that we have compared the approaches fairly, but this is something that we should be able to do.
When it comes to summaries of existing research, these must be entirely qualitative. There should be no attempt to compute an overall effect size or ‘months of additional progress’ in the manner of England’s Education Endowment Foundation. It is meaningless, particularly when we might dispute whether different studies represent examples of the same broad strategy or not. Instead, we need what is effectively a literature review of all the relevant evidence, its strengths and weaknesses. This should be open to continual review and discussion and it will be quite an art to make it pithy enough to accurately and historically capture the state of evidence in a particular area. Nevertheless, it is a venture worth embarking upon. There is no valid alternative.