A failed replication draws a scathing personal attack from a psychology professor

ByEd Yong

Published March 10, 2012

13 min read

John Bargh, a psychologist at Yale University, has published a scathing attack on a paper that failed to replicate one of his most famous studies. His post, written on his own blog on Psychology Today, is a mixture of critiques of the science within the paper, and personal attacks against the researchers, PLOS ONE, the journal that published it, and me, who covered it. I’m going to take a closer look at Bargh’s many objections.

The background

First, a recap. The original study, published in 1996, is indeed a classic. According to Google Scholar, it has been cited almost 2,000 times. Here’s how I described it in my post:

John Bargh and his colleagues found that infusing people’s minds with the concept of age could slow their movements (PDF). The volunteers in the study had to create a sentence from scrambled words pick the odd word from a group of scrambled ones. When this word related to being old, the volunteers walked more slowly when they left the laboratory. They apparently didn’t notice anything untoward about the words, but their behaviour changed nonetheless.

Surprisingly, this prominent result has seldom been replicated. There have been two attempts but neither stuck closely to the original experiment. This prompted Stephane Doyen and colleagues to try and repeat Bargh’s study. They tried to match the original set-up, but they made some tweaks: they timed volunteers with infrared sensors rather than a stopwatch; they doubled the number of volunteers; and they recruited four experimenters who carried out the study, but didn’t know what the point of it was. As I wrote:

This time, the priming words had no impact on the volunteers’ walking speed. They left the test room neither more slowly nor more quickly than when they arrived. Doyen suspected that Bargh’s research team could have unwittingly told their volunteers how they were meant to behave… Perhaps they themselves moved more slowly if they expected the volunteer to do so. Maybe they spoke more languidly, or shook hands more leisurely… Maybe they were responsible for creating the very behaviour they expected to see.

To test that idea, Doyen repeated his experiment with 50 fresh volunteers and 10 fresh experimenters. The experimenters always stuck to the same script, but they knew whether each volunteer had been primed or not. Doyen told half of them that people would walk more slowly thanks to the power of priming, but he told the other half to expect faster walks.

…He found that the volunteers moved more slowly only when they were tested by experimenters who expected them to move slowly… Let that sink in: the only way Doyen could repeat Bargh’s results was to deliberately tell the experimenters to expect those results.

Was this possible? In Bargh’s study, an experimenter had packed envelopes with one of two different word tasks (either elderly-related or neutral words). When each volunteer arrived, the experimenter chose an envelope at random, led the volunteer into a test room, briefed them, and then left them to finish the task.

Doyen thinks that, during this time, the experimenter could have seen which set of tests the volunteer received, and tuned their behaviour accordingly. This was not a deliberate act of manipulation, but could easily have been an unconscious one. He wrote, “This possibility was in fact confirmed informally in our own study, as we found that it was very easy, even unintentionally, to discover the condition in which a particular participant takes part by giving a simple glimpse to the priming material.”

In his new post, Bargh dismisses Doyen’s experiments on two technical points, and other personal ones. Let’s consider each in turn.

Bargh’s objections – blinding

First, he says that “there is no possible way” that the experimenter in his study could have primed the volunteers with his own expectations. He says that the experimenter “was blind to the study hypotheses” (meaning that he didn’t know what the point of the experiment was). Bargh adds, “The person who had actual contact with the participants in the elderly priming study never saw the priming manipulation… and certainly did not know whether the participant was in the elderly priming or the control condition.”

Could the experimenter have known what the experiment was about, even though Bargh asserts that they were blind? In the comments section of Bargh’s post, another psychologist, Matt Craddock, notes that the experimenter was also responsible for pre-packaging the various tasks in their packets, and so had ample time to study the materials. [This is the first of several inconsistencies in Bargh’s interpretation of his own study – more on that later.)

Could the experimenter have primed the volunteers? It’s not clear. This hinges on what actually happened in the test room, and we only have Bargh’s word on this. There is precious little in the way of description in the actual paper (here it is as a PDF; let me know if I’ve missed something). As such, the paper does not seem to be at odds with Doyen’s vision of what happened, although it does not provide evidence for it either.

Bargh’s objections – differences between the two studies

Bargh’s second objection (in many parts) is that Doyen’s study had differences from his own, which would have eliminated the elderly-priming effect. However, in all of these cases, Craddock and other commenters have pointed out inaccuracies in his statements.

For example, he says that after the test, Doyen instructed his volunteers to “go straight down the hall when leaving” (his quotes), while he “let the participant leave in the most natural way”. This is important because drawing someone’s attention to an automatic process tends to eliminate that effect. But Doyen did nothing of the sort, and his paper never contains the words that Bargh quoted. Instead, Doyen wrote, “Participants were clearly directed to the end of the corridor”. It is not clear how this differs from Bargh’s own study where “the experimenter told the participant that the elevator was down the hall”.

Bargh also says that Doyen used too many age-related words in his word task. The volunteers might have noticed, cancelling out the effect of the priming. But this contradicts what Bargh says in his own methods paper, where he says that if there are too many primes, volunteers would be more likely to perform as expected. By that reasoning, Doyen’s volunteers should have showed an even stronger effect.

Bargh says that priming depends on there being something to prime. Volunteers would only walk more slowly if they associated old age with infirmity. He says, “Doyen et al. apparently did not check to make sure their participants possessed the same stereotype of the elderly as our participants did.” However, neither did Bargh. His original study says nothing about assessing stereotypes. [Update: actually, I note that Doyen et al chose their priming words by using the most common answers in an online student survey where people reported adjectives related to old age; that’s at least a tangential way of assessing stereotypes.]

“To adapt the items, we conducted an online survey (80 participants) in which participants had to report 10 adjectives related to the concept of old age. Only the most frequent responses were used as replacement words.” (i.e., as primes)

Bargh says that Doyen used the same experimenter who administered the test to time how slowly the volunteers walked down the hall. This is also false – they used infrared sensors.

What do Doyen’s team have to say about Bargh’s criticisms? They support Craddock’s analysis. And one of the authors, Axel Cleeremans says:

“The fact is that we failed to replicate this experiment, despite having twice as many participants and using objective timing methods. Regardless of the arguments one may come up with that explain why his study worked and ours did not, this suggests that unconscious behavioural priming is not as strong as it is cast to be. If the effect were truly robust, it shouldn’t depend on minute differences. The fact that we did manage to replicate the original results when both experimenters and participants were appropriately primed suggests interesting avenues for further research and should be taken as an opportunity to better delineate the conditions under which the effect is observed.”

Bargh’s objections – er, the other stuff

As stated before, Bargh also directs personal attacks at the authors of the paper (“incompetent or ill-informed”), at PLoS (“does not receive the usual high scientific journal standards of peer-review scrutiny”), and at me (“superficial online science journalism”). The entire post is entitled “Nothing in their heads”.

Yes, well.

I’ve dealt with the scientific aspects of the critique; I think we’re all a bit too old to respond to playground tactics with further puerility. The authors certainly aren’t rising to it. In an email to me, Doyen wrote, “This entire discussion should be about the reasons that best explain the differences between his findings and ours, but has somehow turned into something else that unhelpfully confuses personal attacks with scientific disagreement as well as scientific integrity with publishing politics.” And PLoS publisher Peter Binfield has already corrected Bargh’s “several factual errors” about their journals.

For my part, I’m always happy to correct myself when I’ve screwed up in my reporting. Here, I believe I did my due diligence. Contrary to accusations at the time, I read both the Bargh and Doyen papers. I contacted other psychologists for their view, and none of them spotted egregious technical flaws. More importantly, I sent the paper to Bargh five days before the embargo lifted and asked for a comment. He said, “There are many reasons for a study not to work, and as I had no control over your [sic] attempt, there’s not much I can say.” The two-page piece he has now posted would seem to falsify that statement.

After some reflection, I largely stand by what I wrote. I can’t see much in the original study or in Bargh’s critique that would have caused me to decide not to cover it, or to radically change my approach. There is one thing, though. Someone (on Twitter; sorry, I can’t find the link) noted that a single failure to replicate doesn’t invalidate the original finding, and this is certainly true. That’s something I could have made more explicit in the original post, maybe somewhere in the fourth paragraph. Mea culpa.

Replicate, good times, come on! (It’s a replication… it’s a replicatio-o-on)

There is a wider issue here. A lack of replication is a large problem in psychology (and arguably in science, full stop). Without it, science has lost a limb. Results need to be checked, and they gain strength through repetition. On the other hand, if someone cannot repeat another person’s experiments, that raises some serious question marks.

Scientists get criticised for not carrying out enough replications – there is little glory, after all, in merely duplicating old ground rather than forging new ones. Science journals get criticised for not publishing these attempts. Science journalists get criticised for not covering them. This is partly why I covered Doyen’s study in the first place.

In light of this “file drawer problem”, you might have thought that replication attempts would be welcome. Instead, we get an aggressive and frequently ill-founded attack at everyone involved in such an attempt. Daniel Simons, another noted psychologist, says, “[Bargh’s] post is a case study of what NOT to do when someone fails to replicate one of your findings.”

Others have suggested that the Bargh study has many positive replications, but this is in question. In his post, Bargh speaks of “dozens if not hundreds of other conceptual replications”. He says that the “stereotype priming of behavior effect has been widely replicated”, and cites the well-established “stereotype threat” effect (which I have also written about). He implores responsible scientists and science journalists to not “rush to judgment and make claims that the entire phenomenon in question is illusory”.

I’m not sure which scientists or science journalists he is referring to. Neither Doyen nor I implicated that the entire concept of priming was illusory. I specifically said the opposite, and quoted two other psychologists who did the same. The issue at stake is whether Bargh’s results from that one specific experiment could be replicated. They could not.

Notably, one site – PsychFileDrawer – is trying to rectify the file drawer problem by providing psychologists with a “quick and easy way” to post the results of replication attempts, whether positive or negative. Hal Pashler, who created the site, has also reportedly tried to replicate Bargh’s study and failed.

If there’s an element to this farrago that heartens me, it’s that the comments in Bargh’s piece allowed various parties to set the record straight. In concluding his piece, Bargh says, “I’m worried about your ability to trust supposedly reputable online media sources for accurate information on psychological science.” Well, dear professor, this is the era of post-publication peer review. I’m not that worried.