“Replication” is in the Eye of the Beholder

Black and white circles on starry background

One of the best episodes of the The Twilight Zone centers on a woman living as a patient in a shadowy hospital. She recovers from her eleventh plastic surgery procedure, her face completely covered by bandages. Doctors and nurses, their own faces obscured, whisper about her apparently ghastly appearance. The intensity of the episode builds until finally, the bandages are removed. Realizing the surgery failed yet again, the lead doctor yells out in horror, “no change!”

The twist: the audience is shown the face of a beautiful woman while the doctors and nurses are revealed to have grotesque, pig-like faces. The audience is left dumbstruck and in a state of confusion, yet the heavy-handed message is clear: perception is everything.

I was reminded of this episode as I observed the melodramatic reactions to my recently-released “conceptual replication” of the famous Marshmallow Test. The study, which I co-authored with Greg Duncan and Haonan Quan, was covered here and here, so I won’t rehash the findings in great detail.

In essence, the original Marshmallow Test research (conducted by Psychologist Walter Mischel) reported that kids who were able to delay gratification at a young age were likely to be more successful later in life across a large set of measures (e.g., SAT scores, behavioral outcomes, body mass index, etc.). In our “conceptual replication,” we attempted to re-examine these findings in a larger and more diverse sample of children than the sample used by Mischel. We found some evidence suggesting that success on the Marshmallow Test was predictive of later life success, but we also found that most of the relation between delay of gratification and later outcomes was driven by other factors in a child’s life — like socioeconomic status, early cognitive ability, and parenting.

The response to our work has been much more heated than I expected, as many rushed to label our paper as another “failed” replication of a beloved study from Psychology. It probably did not help matters that findings from several other well-known Psychology studies were also called into question by recently-published replications. Indeed, the famous 30-million word gap was scrutinized in a new Child Development piece, and a large-scale trial testing the concept of ego depletion recently found little evidence in support of the famous theory.

The popular conclusion was clear: these replications have thrown a grenade into the classic Psychology literature and laid waste to every Psychology 101 course in the process!

Yet, after you wade through the click-bait headlines and mic-drop tweets about these studies, you are left with a much less dramatic picture. In each case, the new study has simply moved us a little more toward complexity, and a little further away from a position of certainty. What’s wrong with acknowledging that? In fact, if you read each of these studies carefully, you start to ponder the definition of the word “replication.”

When articles are written about the “replication crisis” (which is a term all-too perfect for an age where complex science debates are reduced to 140-character arguments on social media), we imagine a scientist taking the methods from a published study and replicating the exact steps again with a different sample. If the results change, then the original study must have been false. The crisis builds!

In many areas of social science, particularly in fields like Developmental Psychology and Education, replications rarely work that way. In our study, the methods we used differed from the original work. This was partly due to conscious design decisions that we adopted and partly due to limitations of the dataset we had to work with. We did not attempt to conduct a point-by-point replication of the original Marshmallow Test work, because we were not concerned with “disproving” the older studies. Rather, we were interested in examining the conclusions that had been drawn in the years since the foundational studies were published — and here’s where perception matters.

If you thought that the original work showed that the Marshmallow Test illuminated interesting variation in young children and that this variation correlated with later life outcomes, then our study found some evidence supporting this. In contrast, if you thought that the original work suggested that delay of gratification was a critical skill that was well-captured by the Marshmallow Test and could alone shape a child’s life course, then we found evidence against this.

When our new results are interpreted, they should be viewed alongside the older studies, and should be seen as adding shades of complexity to what we already know — not interpreted as definitive proof that the original work was false. In many cases, treating a single study with the word “replication” in the title as definitive proof of anything falls prey to the same error that led us to over-interpret the results from the original work. This leads to a back and forth that will leave us spinning our wheels rather than developing a knowledge base that furthers our understanding of human cognition and behavior. We should become comfortable with uncertainty and complexity in our research, and the language we use to discuss our findings should reflect this.

The next time we unwrap the findings from a new replication of a social science study, let’s resist the temptation to rush to judgment like the doomed characters in that classic Twilight Zone episode. Instead, we should remember that any judgement over whether a study “passed” or “failed” replication probably depends on what you perceived the results of the original study to mean in the first place.


Dr. Tyler Watts is a Research Assistant Professor and Postdoctoral Scholar in the Steinhardt School of Culture, Education and Human Development at New York University.