Credit: HJ Media Studios
Credit: HJ Media Studios

Welcome To The Era of Big Replication

Psychologists have been sailing through some pretty troubled waters of late. They’ve faced several cases of fraud, high-profile failures to repeat the results of classic experiments, and debates about commonly used methods that are recipes for sexy but misleading results. The critics, many of whom are psychologists themselves, say that these lines of evidence point towards a “replicability crisis”, where an unknown proportion of the field’s results simply aren’t true.

To address these concerns, a team of scientists from 36 different labs joined together, like some sort of verification Voltron, to replicate 13 experiments from past psychological studies. They chose experiments that were simple and quick to do, and merged them into a single online package that volunteers could finish in just 15 minutes.

They then delivered their experimental smorgasbord to 6,344 people from 36 different groups and 12 different countries.

This is Big Replication—scientific self-correction on a massive scale.

I’ve written about this “Many Labs Replication Project” over at Nature News, so head over there for more details and viewpoints from the psychological community. The project was also coordinated by Richard Klein and Kate Ratliff from the University of Florida, Michelangelo Vianello from the University of Padua, and Brian Nosek from the Center for Open Science.

First, 10 of the 13 effects replicated. That’s certainly encouraging after months of battering.

One of the 13 was on the fence—the “imagined contact” effect, where imagining contact with people from other ethnic groups reduces prejudice towards them. It’s hard to say whether this is real or not.

And two of the 13 effects outrightly did not replicate. Both are recent studies involved social priming, the field in which subtle and subconscious cues supposedly influence our later behaviour. In one, exposure to a US flag increases conservatism among Americans; in the other, exposure to money increases endorsement of the current social system.

For Nosek, personally, the results are a mixed bag. Two of his own effects were in the mix and they both checked out. Many classics in the field are robust. This is all good. But a lot of Nosek’s own work involves social priming, and the fact that this sub-field regularly (but not always) stumbles in the replication gauntlet is troubling to him. “This been difficult for me personally because it’s an area that’s important for my research,” he says. “But I choose the red pill. That’s what doing science is.”

But he and others I spoke to also urge caution. This is neither a “te absolvo” for the field, nor a final damnation of social priming. The team chose the 13 effects arbitrarily, to represent a range of different psychological studies from different eras. It doesn’t mean that 10 out of every 13 effects will replicate, nor that 2 out of every 2 social priming ones will flunk. It’s not systematic. (Nosek, incidentally, is also leading a systematic check of reproducibility in psychology, in which more than 150 scientists are repeating every study published in four journals in 2008.  The man is front and centre in this debate.)

To focus too much on the results would miss the point. The critical thing about the Many Labs Project is its approach.

Replications are really important, and there aren’t enough of them in psychology. But single, one-off replications can add more heat than light. If you can’t replicate an earlier study, the knee-jerk reaction is to say that the original was flawed. Alternatively, you could be incompetent. Or you could have changed the original experiment in important ways. Or your new study may be too small. Or you might have studied a completely different group of people. The authors of the original study can always hit back with these objections, and they would not be wrong to.

So, some scientists run meta-analyses—big mega-studies where they look at the results of past experiments and tease out the overall picture. If one replicate attempt is inconclusive, what do all of them say together? But meta-analyses also have flaws. If they don’t publish their failed replications (and until recently, it was really hard to), the meta-analysis will be badly skewed. And if everyone used slightly different methods, the results will still be inconclusive.

The Many Labs project has none of these problems because, as Daniel Simons told me, it is a planned meta-analysis. They’re did many checks all at once, and nothing was hidden away if it didn’t “work”. They consulted with the original authors where possible. They ran the exact same experiment on all of their different samples. They tested a far larger group of people than any of the original experiments (and replication attempts generally need to be bigger than the original studies that they’re checking). And they pre-registered their methods: everything was agreed before a single volunteer was recruited, leaving no room for sneaky data-massaging.

The result is a definitive assessment of the 13 effects. The priming ones didn’t check out. At the other extreme, Nobel laureate Daniel Kahneman comes out of this very well. His classic anchoring effect, in which the first piece of information we get can bias our later decisions, turns out to be much stronger than he estimated in his original experiments.

The Many Labs sample was also diverse, which tells us whether the effects being scrutinised are delicate flowers that only blossom in certain situations, or robust blooms that grow everywhere. This is important, because some psychologists like Joe Cesario from Michigan State University have argued that effects like social priming ought to vary in different contexts, or across different individuals.

I contacted Cesario, and he clarified: “At no point did I make the claim that all effects, or even all priming effects, will vary by laboratory, region, etc. The point was to appreciate the possibility that some priming effects might vary by underappreciated context variables… Absent cross-lab replication, priming researchers cannot make extreme claims about the widespread nature of priming effects.”

In the Many Labs Project, none of the 13 effects varied according to the nationality of the participants, or whether they did the experiments online or in a lab. Kahneman’s work checked out everywhere, and the priming studies failed everywhere. Cesario adds, “The ManyLabs project correctly tells us that [the two effects that did not replicate] aren’t really effects that we as a discipline should care about because they have no generalizability beyond that unique situation.”

It is very telling that everyone I spoke to praised the initiative, including the authors whose work did not replicate. There was none of the acrimony that has stained past debates. When something is done this well, it’s pretty churlish to not accept the results.

This is a harbinger of things to come.

Simons is coordinating a similar multi-lab replication attempt of Jonathan Schooler’s verbal overshadowing effect, in which verbally describing something like a face impairs our recognition for that thing. The effect has been famously tricky to repeat, and Simons says “Our goal is to measure the actual effect size as accurately as possible by conducting the same study in many laboratories.” The results will be published in the journal Perspectives in Psychological Science next spring. “This multi-lab paper provides a preview of what I hope will become a standard approach in psychology.”