# How to do punk research

There’s a statistics website that enables anyone to do their own punk research: estimationstats.com. It’s good enough quality for the output to be used in published scientific papers like mine, but it is simple enough for anyone with a basic command of a spreadsheet like excel to conduct their own experiment. Here’s a worked example.

Firstly, we need to do an experiment that is small and self-contained. Perhaps you want to find the effect of drawing diagrams versus making notes on students’ learning. Next, you need to ensure that you gain any relevant ethics approvals. If you are conducting this research through a university then you would need to follow their ethics process (which can be quite laborious). Otherwise, check with your school.

The next step is to randomly allocate students to one of the two conditions. A rudimentary way to do this is to go down your register and alternate the conditions alphabetically. A more sophisticated method uses a random number generator. It doesn’t really matter. Researchers often use a quasi-experimental design where students are not randomly assigned. Instead, the teacher of Class A does something different to the teacher of Class B and so on. I would avoid such a design if you can. In this case, there are multiple confounds and the statistics we are going to use assume random allocation. So I would suggest doing something where different students in the same class form the two conditions.

In the example we are going to use, some will draw diagrams as they listen to teacher explanations and others will take notes. Make sure you label your conditions in such a way that you will remember what they are. In this case, we will use ‘D’ from drawing diagrams and ‘N’ for taking notes. This is preferable to ‘Condition 1’ and ‘Condition 2’ because the latter will cause you to try and remember which one is which, every time you return to the data.

Finally, you will need to test students. Make sure your test assesses the things that were taught during the experiment, as clearly and as objectively as possible. Avoid trying to add in questions that stray outside of these confines. You are unlikely to get any kind of effect with focused questions so adding irrelevant questions will not help. You may wish to deliberately add transfer questions that assess the same deep structure in a different context, but I would advise making it easy to separate the score on these questions from the score on questions that are more like the learning materials. That way, if there is no effect on transfer, which is likely, you can still look at the non-transfer questions alone.

So imagine that we do the experiment and get the following data (I have made this data up for the purpose of this example):

Now it is time to use estimationstats. First, select “Two Groups”:

You need to enter your data into the online spreadsheet. This comes populated with some mock data and the headings “Control” and “Test”:

You need to rewrite the headings, delete all of the mock data (make sure of this) and paste your data across. This is done with a Ctrl-C and Ctrl-V – I can’t get it to work by right-clicking on my mouse.

I recommend selecting a “Cohen’s d” effect size because this is the effect size that is most commonly understood in education:

The only other alteration I would make to the default settings is to label the y-axis with something a little more meaningful than “value”:

Then click “Analyse” and you should get something like this:

Beneath this, you will see a value for the effect size which, in this case, is an extraordinary d=-1.39. In other words, the mean of “Taking Notes” is 1.39 standard deviations lower than the mean of “Drawing Diagrams”. This is statistically significant at the p<.05 level and we can see this visually because the horizontal line that represents the mean of “Drawing Diagrams” does not overlap with the thick vertical line that extends upwards and downwards from the mean of “Taking Notes”. That vertical line represents the 95% confidence interval around the “Taking Notes” mean.

Estimationstats also gives a p-value, but it’s an unusual ‘non-parametric’ one. We won’t go into that here, but the estimationstats website provides a link if you want to read more about it.

If you do your own experiment, you are unlikely to find anything as clear-cut as this, but if you do then you should probably let us all know.

Right. Over to you.

Standard

## 11 thoughts on “How to do punk research”

1. Jamie says:

Seems like a great tool and a super for getting started with gathering evidence in context. Thanks for sharing.

I am genuinely curious about the following:
Why no pretest?
With a sample of this size or even twice the size wouldn’t it be plausible that even with random distribution, one group could be easily unintentionally be skewed in terms of ability level?
Wouldn’t you need to demonstrate a fair comparison first rather than assuming it? Or judging it by relative improvement rather than final assessment mark/grade?

Thanks for you time.

• You could do a pre-test and that would add information and validity, but it is not necessary because the statistical tests take account of the random allocation of students to groups. In other words, it is already assumed by these tests that one group will be superior to the other by chance. For ‘punk’ research, it is reasonable to keep things as simple as possible. There is also an additional reason – by adding more data, you become more prone to p-hacking. For instance, the EEF Philosophy for Children trial in its initial protocol intended to do something similar to what I have outlined above. However, when they found no effect, researchers went back to the Key Stage 1 data for the students (I think) which is similar to a pre-test and suggested that the students who had P4C made greater relative gains.

• Stephen Gorard says:

This comment on a P4C trial is an outright lie and should be removed.

• I certainly believe it to be true. Am I wrong? Was the final analysis used one that was specified in the original protocol?

• Stephen Gorard says:

If you had said you merely ‘believed’ it to be true, your belief would be incorrect, but there is no real way of controlling what anyone believes via faith. You stated it as a fact as though this was something that actually happened – that we waited until the results were in and changed the analysis to suit some unspecified purpose (you actually referred to it as p-hacking although it is clear that there is no p involved). None of this happened. You have simply made it up.

• What precisely are you objecting to? I assume you ran the original analyses specified in the protocol.

• Stephen Gorard says:

Please do not play dumb. You said “when they found no effect, researchers went back to the Key Stage 1 data for the students (I think) which is similar to a pre-test and suggested that the students who had P4C made greater relative gains”. In fact, the decision to use gain scores was taken at the outset in discussion with EEF, because the pre-intervention data was judged not balanced between groups. This was the first ever EEF effectiveness trial, and there were no published protocols at that time. I have explained this to you before. What you said happened just did not happen – even if you genuinely believe it for some personal reason. Your on-line statement is not just incorrect. It is defamatory for no reason. It should be taken down, if you care about evidence (not your belief). You could replace it with a statement saying that while a post-test only was preferred (I always prefer where possible) this was changed to gain scores, and you might even say you have some concern about this (but see my 2017 trials book where the reasons are fully explained).

• You have never explained this to me before. In fact, you have tended to be rather dismissive and not answer my questions. The protocol sets out two sets of tests. On these measures, there was no difference between the intervention and control groups. In my view, that should have been the end of it. If you are now claiming that the gain scores measure was introduce after the protocol was written but before the final results were in then this is news to me. I accept that. However, in the absence of this information I think the conclusion I drew was a reasonable one. I have not read your book.

• Stephen Gorard says:

As I said please do not play dumb. You did not say that you drew a ‘conclusion’ that is one, perhaps the most negative, of many possible explanations why you do not understand what happened. You told your readers, as though it were a fact, that we waited until we had the results and only then changed our analysis. You have no evidence that this happened. It did not happen. It is, as I said at the outset, an outright lie. Ask EEF if you still do not really accept that. I cannot be bothered to trawl back to tweets where I told you all this before. You seem to be conducting some kind of vendetta (against P4C?). Leave it there. We are content. Maybe read more.

• The whole point of pre-registering trials and specifying the measures that will be used at the end of the trial is to stop researchers looking at the data after it has been collected and then running the most advantageous analysis. In this case, given that you agreed that gains measure from the outset, it seems deeply unfortunate that this was not in the published protocol that you wrote. In that context, my interpretation was entirely reasonable. However, I now accept your claims about this which, to the best of my knowledge, you have never made to me before.

As for P4C, I do not believe it has an effect on reading scores or whatever because it is a distal intervention with no proposed mechanism of action and with any inferred mechanism of action being implausible. Due to this piece of research, the EEF has now thrown millions at a scale-up study. Maybe that will show something. Maybe it will not.

This site uses Akismet to reduce spam. Learn how your comment data is processed.