Bold changes at Psychological Science

Style manuals sound like they ought to be boring things, full of arcane points about commas and whatnot. But Wikipedia’s style manual has an interesting admonition: Be bold. The idea is that if you see something that could be improved, you should dive in and start making it better. Don’t wait until you are ready to be comprehensive, don’t fret about getting every detail perfect. That’s the path to paralysis. Wikipedia is an ongoing work in progress, your changes won’t be the last word but you can make things better.

In a new editorial at Psychological Science, interim editor Stephen Lindsay is clearly following the be bold philosophy. He lays out a clear and progressive set of principles for evaluating research. Beware the “troubling trio” of low power, surprising results, and just-barely-significant results. Look for signs of p-hacking. Care about power and precision. Don’t confuse nonsignificant for null.

To people who have been paying attention to the science reform discussion of the last few years (and its longstanding precursors), none of this is new. What is new is that an editor of a prominent journal has clearly been reading and absorbing the last few years’ wave of careful and thoughtful scholarship on research methods and meta-science. And he is boldly acting on it.

I mean, yes, there are some things I am not 100% in love with in that editorial. Personally, I’d like to see more value placed on good exploratory research.* I’d like to see him discuss whether Psychological Science will be less results-oriented, since that is a major contributor to publication bias.** And I’m sure other people have their objections too.***

But… Improving science will forever be a work in progress. Lindsay has laid out a set of principles. In the short term, they will be interpreted and implemented by humans with intelligence and judgment. In the longer term, someone will eventually look at what is and is not working and will make more changes.

Are Lindsay’s changes as good as they could possibly be? The answers are (1) “duh” because obviously no and (2) “duh” because it’s the wrong question. Instead let’s ask, are these changes better than things have been? I’m not going to give that one a “duh,” but I’ll stand behind a considered “yes.”

———-

* Part of this is because in psychology we don’t have nearly as good a foundation of implicit knowledge and accumulated wisdom for differentiating good from bad exploratory research as we do for hypothesis-testing. So exploratory research gets a bad name because somebody hacks around in a tiny dataset and calls it “exploratory research,” and nobody has the language or concepts to say why they’re doing it wrong. I hope we can fix that. For starters, we could start stealing more ideas from the machine learning and genomics people, though we will need to adapt them for the particular features of our scientific problems. But that’s a blog post for another day.

** There are some nice comments about this already on the ISCON facebook page. Dan Simons brought up the exploratory issue; Victoria Savalei the issue about results-focus. My reactions on these issues are in part bouncing off of theirs.

*** When I got to the part about using confidence intervals to support the null, I immediately had a vision of steam coming out of some of the Twitter Bayesians’ ears.

Psychological Science to publish direct replications (maybe)

Pretty big news. Psychological Science is seriously discussing 3 new reform initiatives. They are outlined in a letter being circulated by Eric Eich, editor of the journal, and they come from a working group that includes top people from APS and several other scientists who have been active in working for reforms.

After reading it through (which I encourage everybody to do), here are my initial takes on the 3 initiatives:

Initiative 1: Create tutorials on power, effect size, and confidence intervals. There’s plenty of stuff out there already, but if PSci creates a good new source and funnels authors to it, it could be a good thing.

Initiative 2: Disclosure statements about research process (such as how sample size was determined, unreported measures, etc.) This could end up being a good thing, but it will be complicated. Simine Vazire, one of the working group members who is quoted in the proposal, puts it well:

We are essentially asking people to “incriminate” themselves — i.e., reveal information that, in the past, editors have treated as reasons not to publish a paper. If we want authors to be honest, I think they will want some explicit acknowledgement that some degree of messiness (e.g., a null result here and there) will be tolerated and perhaps even treated as evidence that the entire set of findings is even more plausible (a la [Gregory] Francis, [Uli] Schimmack, etc.).

I bet there would be low consensus about what kinds and amounts of messiness are okay, because no one is accustomed to seeing that kind of information on a large scale in other people’s studies. It is also the case that things that are problematic in one subfield may be more reasonable in another. And reviewers and editors who lack the time or local expertise to really judge messiness against merit may fall back on simplistic heuristics rather than thinking things through in a principled way. (Any psychologist who has ever tried to say anything about causation, however tentative and appropriately bounded, in data that was not from a randomized experiment probably knows what that feels like.)

Another basic issue is whether people will be uniformly honest in the disclosure statements. I’d like to believe so, but without a plan for real accountability I’m not sure. If some people can get away with fudging the truth, the honest ones will be at a disadvantage.

3. A special submission track for direct replications, with 2 dedicated Associate Editors and a system of pre-registration and prior review of protocols to allow publication decisions to be decoupled from outcomes. A replication section at a journal? If you’ve read my blog before you might guess that I like that idea a lot.

The section would be dedicated to studies previously published in Psychological Science, so in that sense it is in the same spirit as the Pottery Barn Rule. The pre-registration component sounds interesting — by putting a substantial amount of review in place before data are collected, it helps avoid the problem of replications getting suppressed because people don’t like the outcomes.

I feel mixed about another aspect of the proposal, limiting replications to “qualified” scientists. There does need to be some vetting, but my hope is that they will set the bar reasonably low. “This paradigm requires special technical knowledge” can too easily be cover for “only people who share our biases are allowed to study this effect.” My preference would be for a pro-data, pro-transparency philosophy. Make it easy for for lots of scientists to run and publish replication studies, and make sure the replication reports include information about the replicating researchers’ expertise and experience with the techniques, methods, etc. Then meta-analysts can code for the replicating lab’s expertise as a moderator variable, and actually test how much expertise matters.

My big-picture take. Retraction Watch just reported yesterday on a study showing that retractions, especially retractions due to misconduct, cause promising scientists to move to other fields and funding agencies to direct dollars elsewhere. Between alleged fraud cases like Stapel, Smeesters, and Sanna, and all the attention going to false-positive psychology and questionable research practices, psychology (and especially social psychology) is almost certainly at risk of a loss of talent and money.

Getting one of psychology’s top journals to make real reforms, with the institutional backing of APS, would go a long way to counteract those negative effects. A replication desk in particular would leapfrog psychology past what a lot of other scientific fields do. Huge credit goes to Eric Eich and everyone else at APS and the working group for trying to make real reforms happen. It stands a real chance of making our science better and improving our credibility.

Secular trends in publication bias

Abstract of Negative results are disappearing from most disciplines and countries (PDF) by Daniele Fanelli, Scientometrics, 2012 (thanks to Brent Roberts for forwarding it):

Concerns that the growing competition for funding and citations might distort science are frequently discussed, but have not been verified directly. Of the hypothesized problems, perhaps the most worrying is a worsening of positive-outcome bias. A system that disfavours negative results not only distorts the scientific literature directly, but might also discourage high-risk projects and pressure scientists to fabricate and falsify their data. This study analysed over 4,600 papers published in all disciplines between 1990 and 2007, measuring the frequency of papers that, having declared to have ‘‘tested’’ a hypothesis, reported a positive support for it. The overall frequency of positive supports has grown by over 22% between 1990 and 2007, with significant differences between disciplines and countries. The increase was stronger in the social and some biomedical disciplines. The United States had published, over the years, significantly fewer positive results than Asian countries (and particularly Japan) but more than European countries (and in particular the United Kingdom). Methodological artefacts cannot explain away these patterns, which support the hypotheses that research is becoming less pioneering and/or that the objectivity with which results are produced and published is decreasing.

My reactions…

Sarcastic: together with the Flynn Effect this is clearly a sign that we’re getting smarter.

Not: There is no single solution to this problem, but my proposal is something you could call the Pottery Barn Rule for journals. Once a journal publishes a study, it should be obliged to publish any and all exact or near-exact replication attempts in an online supplement, and link to such attempts from the original article. That would provide a guaranteed outlet for people to run exact replication attempts, something we do not do nearly enough of. And it would create an incentive for authors, editors, and publishers to be rigorous since non-replications would be hung around the original article’s neck. (And if nobody bothers to try to replicate the study, that would probably be a sign of something too.)

An editorial board discusses fMRI analysis and “false-positive psychology”

Update 1/3/2012: I have seen a few incoming links describing the Psych Science email discussion as “leaked” or “made public.” For the record, the discussion was forwarded to me from someone who got it from a professional listserv, so it was already out in the open and circulating before I posted it here. Considering that it was carefully redacted and compiled for circulation by the incoming editor-in-chief, I don’t think “leaked” is a correct term at all (and “made public” happened before I got it).

***

I recently got my hands on an email discussion among the Psychological Science editorial board. The discussion is about whether or how to implement recommendations by Poldrack et al. (2008) and Simmons, Nelson, and Simonsohn (2011) for research methods and reporting. The discussion is well worth reading and appears to be in circulation already, so I am posting it here for a wider audience. (All names except the senior editor, John Jonides, and Eric Eich who compiled the discussion, were redacted by Eich; commenters are instead numbered.)

The Poldrack paper proposes guidelines for reporting fMRI experiments. The Simmons paper is the much-discussed “false-positive psychology” paper that was itself published in Psych Science. The argument in the latter is that slippery research and reporting practices can produce “researcher degrees of freedom” that inflate Type I error. To reduce these errors, they make 6 recommendations for researchers and 4 recommendations for journals to reduce these problems.

There are a lot of interesting things to come out of the discussion. Regarding the Poldrack paper, the discussion apparently got started when a student of Jonides analyzed the same fMRI dataset under several different defensible methods and assumptions and got totally different results. I can believe that — not because I have extensive experience with fMRI analysis (or any hands-on experience at all), but because that’s true with any statistical analysis where there is not strong and widespread consensus on how to do things. (See covariate adjustment versus difference scores.)

The other thing about the Poldrack discussion that caught my attention was commenter #8, who asked that more attention be given to selection and determination of ROIs. S/he wrote:

We, as psychologists, are not primarily interested in exploring the brain. Rather, we want to harness fMRI to reach a better understanding of psychological process. Thus, the choice of the various ROIs should be derived from psychological models (or at least from models that are closely related to psychological mechanisms). Such a justification might be an important editorial criterion for fMRI studies submitted to a psychological journal. Such a psychological model might also include ROIs where NO activity is expected, control regions, so to speak.

A.k.a. convergent and discriminant validity. (Once again, the psychometricians were there first.) A lot of research that is billed (in the press or in the scientific reports themselves) as reaching new conclusions about the human mind is really, when you look closely, using established psychological theories and methods as a framework to explore the brain. Which is a fine thing to do, and in fact is a necessary precursor to research that goes the other way, but shouldn’t be misrepresented.

Turning to the Simmons et al. piece, there was a lot of consensus that it had some good ideas but went too far, which is similar to what I thought when I first read the paper. Some of the Simmons recommendations were so obviously important that I wondered why they needed to be made at all, because doesn’t everybody know them already? (E.g., running analyses while you collect data and using p-values as a stopping rule for sample size — a definite no-no.) The fact that Simmons et al. thought this needed to be said makes me worried about the rigor of the average research paper. Other of their recommendations seemed rather rigid and targeted toward a pretty small subset of research designs. The n>20 rule and the “report all your measures” rule might make sense for small-and-fast randomized experiments of the type the authors probably mostly do themselves, but may not work for everything (case studies, intensive repeated-measures studies, large multivariate surveys and longitudinal studies, etc.).

Commenter #8 (again) had something interesting to say about a priori predictions:

It is always the educated reader who needs to be persuaded using convincing methodology. Therefore, I am not interested in the autobiography of the researcher. That is, I do not care whether s/he has actually held the tested hypothesis before learning about the outcomes…

Again, an interesting point. When there is not a strong enough theory that different experts in that theory would have drawn the same hypotheses independently, maybe a priori doesn’t mean much? Or put a little differently: a priori should be grounded in a publicly held and shared understanding of a theory, not in the contents of an individual mind.

Finally, a general point that many people made was that Psych Science (and for that matter, any journal nowadays) should make more use of supplemental online materials (SOM). Why shouldn’t stimuli, scripts, measures, etc. — which are necessary to conduct exact replications — be posted online for every paper? In current practice, if you want to replicate part or all of someone’s procedure, you need to email the author. Reviewers almost never have access to this material, which means they cannot evaluate it easily. I have had the experience of getting stimuli or measures for a published study and seeing stuff that made me worry about demand characteristics, content validity, etc. That has made me wonder why reviewers are not given the opportunity to closely review such crucial materials as a matter of course.

Journals can be groundbreaking or definitive, not both

I was recently invited to contribute to Personality and Social Psychology Connections, an online journal of commentary (read: fancy blog) run by SPSP. Don Forsyth is the editor, and the contributors include David Dunning, Harry Reis, Jennifer Crocker, Shige Oishi, Mark Leary, and Scott Allison. My inaugural post is titled “Groundbreaking or definitive? Journals need to pick one.” Excerpt:

Do our top journals need to rethink their missions of publishing research that is both groundbreaking and definitive? And as a part of that, do they — and we scientists — need to reconsider how we engage with the press and the public?…

In some key ways groundbreaking is the opposite of definitive. There is a lot of hard work to be done between scooping that first shovelful of dirt and completing a stable foundation. And the same goes for science (with the crucial difference that in science, you’re much more likely to discover along the way that you’ve started digging on a site that’s impossible to build on). “Definitive” means that there is a sufficient body of evidence to accept some conclusion with a high degree of confidence. And by the time that body of evidence builds up, the idea is no longer groundbreaking.

Read it here.

 

How should journals handle replication studies?

Recently Ben Goldacre wrote about a group of researchers (Stuart Ritchie, Chris French, and Richard Wiseman) whose null replication of 3 experiments from the infamous Bem ESP paper was rejected by JPSP – the same journal that published Bem’s paper.

JPSP is the flagship journal in my field, and I’ve published in it and I’ve reviewed for it, so I’m reasonably familiar with how it ordinarily works. It strives to publish work that is theory-advancing. I haven’t seen the manuscript, but my understanding is that the Ritchie et al. experiments were exact replications (not “replicate and extend” studies). In the usual course of things, I wouldn’t expect JPSP to accept a paper that only reported exact replication studies, even if their results conflicted with the original study.

However, the Bem paper was extraordinary in several ways. I had two slightly different lines of thinking about JPSP’s rejection.

My first thought was that given the extraordinary nature of the Bem paper, maybe JPSP has a special obligation to go outside of its usual policy. Many scientists think that Bem’s effects are impossible, which created the big controversy around the paper. So in this instance, a null replication has a special significance that usually it would not. That would be especially true if the results reported by Ritchie et al. fell outside of the Bem studies’ replication interval (i.e., if they statistically conflicted; I don’t know whether or not that is thecase).

My second line of thinking was slightly different. Some people have suggested that the Bem paper shines a light on shortcomings of our usual criteria for what constitutes good methodology. Tal Yarkoni made this argument very well. In short: the Bem paper was judged by the same standard that other papers are judged by. So the fact that an effect that most of us consider impossible was able to pass that standard should cause us to question the standard, rather than just attacking the paper.

So by that same line of thinking, maybe the rejection of the Ritchie et al. null replication should make us rethink the usual standards for how journals treat replications. Prior to electronic publication — in an age where journal pages were scarce and expensive — the JPSP policy made sense for a flagship journal that strived to be “theory advancing.” But a consequence of that kind of policy is that exact replication studies are undervalued. Since researchers know from the outset that the more prestigious journals won’t publish exact replications, we have a low incentive to invest time and energy running them. Replications still get run, but often only if a researcher can think of some novel extension, like a moderator variable or a new condition to compare the old ones too. And then the results might only get published if the extension yields a novel and statistically significant result.

But nowadays, in the era of electronic publication, why couldn’t a journal also publish an online supplement of replication studies? Call it “JPSP: Replication Reports.” It would be a home for all replication attempts of studies originally published in the journal. This would have benefits for individual investigators, for journals, and for the science as a whole.

For individual investigators, it would be an incentive to run and report exact replication studies simply to see if a published effect can be reproduced. The market – that is, hiring and tenure committees – would sort out how much credit to give people for publishing such papers, in relation to the more usual kind. Hopefully it would be greater than zero.

For journals, it would be additional content and added value to users of their online services. Imagine if every time you viewed the full text of a paper, there was a link to a catalog of all replication attempts. In addition to publishing and hosting replication reports, journals could link to replicate-and-extend studies published elsewhere (e.g., as a subset of a “cited by” index). That would be a terrific service to their customers.

For the science, it would be valuable to encourage and document replications better than we currently do. When a researcher looks up an article, you could immediately and easily see how well the effect has survived replication attempts. It would also help us organize information better for meta-analyses and the like. It would help us keep labs and journals honest by tracking phenomena like the notorious decline effect and publication bias. In the short term that might be bad for some journals (I’d guess that journals that focus on novel and groundbreaking research are going to show stronger decline curves). But in the long run, it would be another index (alongside impact factors and the like) of the quality of a journal — which the better journals should welcome if they really think they’re doing things right. It might even lead to improvement of some of the problems that Tal discussed. If researchers, editors, and publishers knew that failed replications would be tied around the neck of published papers, there would be an incentive to improve quality and close some methodological holes.

Are there downsides that I’m not thinking of? Probably. Would there be barriers to adopting this? Almost certainly. (At a minimum, nobody likes change.) Is this a good idea? A terrible idea? Tell me in the comments.

Postscript: After I drafted this entry and was getting ready to post it, I came across this article in New Scientist about the rejection. It looks like Richard Wiseman already had a similar idea:

“My feeling is that the whole system is out of date and comes from a time when journal space was limited.” He argues that journals could publish only abstracts of replication studies in print, and provide the full manuscript online.

MIT restricts academic freedom?

According to an article at Ars Technica, the faculty at MIT have voted to require that all academic publications be open-access. More specifically, the policy requires that when submitting an article to a journal publisher, authors must grant MIT a license to distribute the work for free, and authors have to provide the publication to the MIT provost. If you want to publish with a journal that refuses to allow open access, you have to submit a written request and get approval from the provost.

I’m all for open public access. But I am also all for academic freedom. When a university dictates where its faculty can publish, that seems to me to set a dangerous precedent. If a university can say that faculty cannot publish in Journal X because the university doesn’t like the journal’s copyright policy, who’s to say that the next step isn’t “Don’t publish in Journal Y because we don’t like their editorial position on [fill in controversial issue here]”?