Reading “The Baby Factory” in context

cherry orchard
Photo credit: Des Blenkinsopp.

Yesterday I put up a post about David Peterson’s ethnography The Baby Factory, an ethnography of 3 baby labs that discusses Peterson’s experience as a participant observer. My post was mostly excerpts, with a short introduction at the beginning and a little discussion at the end. That was mostly to encourage people to go read it. (It’s open-access!)

Today I’d like to say a little more.

How you approach the article probably depends a lot on what background and context you come to it with. It would be a mistake to look to an ethnography for a generalizable estimate of something about a population, in this case about how common various problematic practices are. That’s not what ethnography is for. But at this point in history, we are not lacking for information about the ways we need to improve psychological science. There have been surveys and theoretical analyses and statistical analyses and single-lab replications and coordinated many-lab replications and all the rest. It’s getting harder and harder to claim that the evidence is cherry-picked without seriously considering the possibility that you’re in the middle of a cherry orchard. As Simine put it so well:

even if you look at your own practices and those of everyone you know, and you don’t see much p-hacking going on, the evidence is becoming overwhelming that p-hacking is happening a lot. my guess is that the reason people can’t reconcile that with the practices they see happening in their labs and their friends’ labs is that we’re not very good at recognizing p-hacking when it’s happening, much less after the fact. we can’t rely on our intuitions about p-hacking. we have to face the facts. and, in my view, the facts are starting to look pretty damning.

You don’t even have to go as far as Simine or me. You just have to come into reading the ethnography with a realistic belief that problematic practices are at least at a high enough rate to be worrisome. And then the ethnography does what ethnographies do, and well in my judgment: it illustrates what these things look like, out there in the world, when they are happening.

In particular, I think a valuable part of Peterson’s ethnography is that it shows how problematic practices don’t just have to happen furtively by one person with the door closed. Instead, they can work their way into the fabric of how members of a lab talk and interact. When Leslie John et al. introduced the term questionable research practices, they defined it as “exploitation of the gray area of acceptable practice.” The Baby Factory gives us a view into how that can be a social process. Gray zones are by definition ambiguous; should we be shocked to find out that people working closely together will come to a socially shared understanding of them?

Another thing Peterson’s ethnography does is talk about the larger context where all this is happening, and try to interpret his observations in that context. He writes about the pressures for creating a public narrative of science that looks sharp and clean, about the need to make the most of very limited resources and opportunities, and about the very real challenges of working with babies (the “difficult research objects” of the subtitle). A commenter yesterday thought he came to the project with an axe to grind. But his interpretive framing was very sympathetic to the challenges of doing infant cognition research. And his concluding paragraphs were quite optimistic, suggesting that the practices he observed may be part of a “local culture” that has figured out how they can promote positive scientific development. I wish he’d developed that argument more. I don’t think infant cognition research has lacked for important scientific discoveries — but I would say it is in spite of the compromises researchers have sometimes had to make, not because of them.

I do think it would be a mistake to come away thinking this is something limited to infant cognition research. Peterson grounds his discussion in the specific challenges of studying babies, who have a habit of getting distracted or falling asleep or putting your stimuli in their mouths. Those particular problems may be distinctive to having babies as subjects, and I can understand why that framing might make baby researchers feel especially uncomfortable. But anybody who is asking big questions about the human mind is working with a difficult research object, and we all face the same larger pressures and challenges. There are some great efforts under way to understand the particular challenges of research practice and replicability in infant research, but whatever we learn from that is going to be about how broader problems are manifesting in a specific area. I don’t really see how you can fairly conclude otherwise.

An eye-popping ethnography of three infant cognition labs

I don’t know how else to put it. David Peterson, a sociologist, recently published an ethnographic study of 3 infant cognition labs. Titled “The Baby Factory: Difficult Research Objects, Disciplinary Standards, and the Production of Statistical Significance,” it recounts his time spend as a participant observer in those labs, attending lab meetings and running subjects.

In his own words, Peterson “shows how psychologists produce statistically significant results under challenging circumstances by using strategies that enable them to bridge the distance between an uncontrollable research object and a professional culture that prizes methodological rigor.” The account of how the labs try to “bridge the distance” reveals one problematic practice after another, in a way that sometimes makes them seem like normal practice and no big deal to the people in the labs. Here are a few examples.

Protocol violations that break blinding and independence:

…As a routine part of the experiments, parents are asked to close their eyes to prevent any unconscious influence on their children. Although this was explicitly stated in the instructions given to parents, during the actual experiment, it was often overlooked; the parents’ eyes would remain open. Moreover, on several occasions, experimenters downplayed the importance of having one’s eyes closed. One psychologist told a mother, “During the trial, we ask you to close your eyes. That’s just for the journals so we can say you weren’t directing her attention. But you can peek if you want to. It’s not a big deal. But there’s not much to see.”

Optional stopping based on data peeking:

Rather than waiting for the results from a set number of infants, experimenters began “eyeballing” the data as soon as babies were run and often began looking for statistical significance after just 5 or 10 subjects. During lab meetings and one-on-one discussions, experiments that were “in progress” and still collecting data were evaluated on the basis of these early results. When the preliminary data looked good, the test continued. When they showed ambiguous but significant results, the test usually continued. But when, after just a few subjects, no significance was found, the original protocol was abandoned and new variations were developed.

Invalid comparisons of significant to nonsignificant:

Because experiments on infant subjects are very costly in terms of both time and money, throwing away data is highly undesirable. Instead, when faced with a struggling experiment using a trusted experimental paradigm, experimenters would regularly run another study that had higher odds of success. This was accomplished by varying one aspect of the experiment, such as the age of the participants. For instance, when one experiment with 14-month-olds failed, the experimenter reran the same study with 18-month-olds, which then succeeded. Once a significant result was achieved, the failures were no longer valueless. They now represented a part of a larger story: “Eighteen-month-olds can achieve behavior X, but 14-month-olds cannot.” Thus, the failed experiment becomes a boundary for the phenomenon.

And HARKing:

When a clear and interesting story could be told about significant findings, the original motivation was often abandoned. I attended a meeting between a graduate student and her mentor at which they were trying to decipher some results the student had just received. Their meaning was not at all clear, and the graduate student complained that she was having trouble remembering the motivation for the study in the first place. Her mentor responded, “You don’t have to reconstruct your logic. You have the results now. If you can come up with an interpretation that works, that will motivate the hypothesis.”

A blunt explanation of this strategy was given to me by an advanced graduate student: “You want to know how it works? We have a bunch of half-baked ideas. We run a bunch of experiments. Whatever data we get, we pretend that’s what we were looking for.” Rather than stay with the original, motivating hypothesis, researchers in developmental science learn to adjust to statistical significance. They then “fill out” the rest of the paper around this necessary core of psychological research.

Peterson discusses all this in light of recent discussions about replicability and scientific practices in psychology. He says that researchers have basically 3 choices: limit the scope of your questions to what you can do well with available methods, relax our expectations of what a rigorous study looks like, or engage in QRPs. I think that is basically right. It is why I believe that any attempt to reduce QRPs has to be accompanied by changes to incentive structures, which govern the first two.

Peterson also suggests that QRPs are “becoming increasingly unacceptable.” That may be true in public discourse, but the inside view presented by his ethnography suggests that unless more open practices become standard, labs will continue to have lots of opportunity to engage in them and little incentive not to.

UPDATE: I discuss what all this means in a followup post: Reading “The Baby Factory” in context.