Everything is fucked: The syllabus

PSY 607: Everything is Fucked
Prof. Sanjay Srivastava
Class meetings: Mondays 9:00 – 10:50 in 257 Straub
Office hours: Held on Twitter at your convenience (@hardsci)

In a much-discussed article at Slate, social psychologist Michael Inzlicht told a reporter, “Meta-analyses are fucked” (Engber, 2016). What does it mean, in science, for something to be fucked? Fucked needs to mean more than that something is complicated or must be undertaken with thought and care, as that would be trivially true of everything in science. In this class we will go a step further and say that something is fucked if it presents hard conceptual challenges to which implementable, real-world solutions for working scientists are either not available or routinely ignored in practice.

The format of this seminar is as follows: Each week we will read and discuss 1-2 papers that raise the question of whether something is fucked. Our focus will be on things that may be fucked in research methods, scientific practice, and philosophy of science. The potential fuckedness of specific theories, research topics, etc. will not be the focus of this class per se, but rather will be used to illustrate these important topics. To that end, each week a different student will be assigned to find a paper that illustrates the fuckedness (or lack thereof) of that week’s topic, and give a 15-minute presentation about whether it is indeed fucked.


20% Attendance and participation
30% In-class presentation
50% Final exam

Week 1: Psychology is fucked

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.

Week 2: Significance testing is fucked

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312.

Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E. J. (2016). Is there a free lunch in inference? Topics in Cognitive Science, 8, 520-547.

Week 3: Causal inference from experiments is fucked

Chapter 3 from: Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.

Week 4: Mediation is fucked

Bullock, J. G., Green, D. P., & Ha, S. E. (2010). Yes, but what’s the mechanism?(don’t expect an easy answer). Journal of Personality and Social Psychology, 98, 550-558.

Week 5: Covariates are fucked

Culpepper, S. A., & Aguinis, H. (2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16, 166-178.

Westfall, J., & Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PloS one, 11, e0152719.

Week 6: Replicability is fucked

Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7, 531-536.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

Week 7: Interlude: Everything is fine, calm the fuck down

Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the reproducibility of psychological science.” Science, 251, 1037a.

Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70, 487-498.

Week 8: Scientific publishing is fucked

Fanelli, D. (2011). Negative results are disappearing from most disciplines and countries. Scientometrics, 90, 891-904.

Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Med, 2, e124.

Week 9: Meta-analysis is fucked

Inzlicht, M., Gervais, W., & Berkman, E. (2015). Bias-Correction Techniques Alone Cannot Determine Whether Ego Depletion is Different from Zero: Commentary on Carter, Kofler, Forster, & McCullough, 2015. Available at SSRN: http://ssrn.com/abstract=2659409 or http://dx.doi.org/10.2139/ssrn.2659409

Van Elk, M., Matzke, D., Gronau, Q. F., Guan, M., Vandekerckhove, J., & Wagenmakers, E. J. (2015). Meta-analyses are no substitute for registered replications: A skeptical perspective on religious priming. Frontiers in Psychology, 6.

Week 10: The scientific profession is fucked

Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psychological science. Perspectives on Psychological Science, 7, 543-554.

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615-631.

Finals week

Wear black and bring a #2 pencil.

Don’t change your family-friendly tenure extension policy just yet

pixelated something

If you are an academic and on social media, then over the last weekend your feed was probably full of mentions of an article by economist Justin Wolfers in the New York Times titled “A Family-Friendly Policy That’s Friendliest to Male Professors.”

It describes a study by three economists of the effects of parental tenure extension policies, which give an extra year on the tenure clock when people become new parents. The conclusion is that tenure extension policies do make it easier for men to get tenure, but they unexpectedly make it harder for women. The finding has a counterintuitive flavor – a policy couched in gender-neutral terms and designed to help families actually widens a gender gap.

Except there are a bunch of odd things that start to stick out when you look more closely at the details, and especially at the original study.

Let’s start with the numbers in the NYT writeup:

The policies led to a 19 percentage-point rise in the probability that a male economist would earn tenure at his first job. In contrast, women’s chances of gaining tenure fell by 22 percentage points. Before the arrival of tenure extension, a little less than 30 percent of both women and men at these institutions gained tenure at their first jobs.

Two things caught my attention when I read this. First, that a 30% tenure rate sounded awfully low to me (this is at the top-50 PhD-granting economics departments). Second, that tenure extension policies took the field from parity (“30 percent of both men and women”) to a 6-to-1 lopsided rate favoring men (the effects are percentage points, so it goes to a 49% tenure rate for men vs. 8% for women). That would be a humongous effect size.

Regarding the 30% tenure rate, it turns out the key words are “at their first jobs.” This analysis compared people who got tenure at their first job to everybody else — which means that leaving for a better outside offer is treated the same in this analysis as being denied tenure. So the tenure-at-first-job variable is not a clear indicator of whether the policy is helping or hurting a career. What if you look at the effect of the policy on getting tenure anywhere? The authors did that, and they summarize the analysis succinctly: “We find no evidence that gender-neutral tenure clock stopping policies reduce the fraction of women who ultimately get tenure somewhere” (p. 4). That seems pretty important.

What about that swing from gender-neutral to a 6-to-1 disparity in the at-first-job analysis? Consider this: “There are relatively few women hired at each university during the sample period. On average, only four female assistant professors were hired at each university between 1985 and 2004, compared to 17 male assistant professors” (p. 17). That was a stop-right-there moment for me: if you are an economics department worried about gender equality, maybe instead of rethinking tenure extensions you should be looking at your damn hiring practices. But as far as the present study goes, there are n = 62 women at institutions that never adopted gender-neutral tenure extension policies, and n = 129 at institutions that did. (It’s even worse than that because only a fraction of them are relevant for estimating the policy effect; more on that below). With a small sample there is going to be a lot of uncertainty in the estimates under the best of conditions. And it’s not the best of conditions: Within the comparison group (the departments that never adopted a tenure extension policy), there are big, differential changes in men’s and women’s tenure rates over the study period (1985 to 2004): Over time, men’s tenure rate drops by about 25%, and women’s tenure rate doubles from 12% to 25%. Any observed effect of a department adopting a tenure-extension policy is going to have to be estimated in comparison to that noisy, moving target.

Critically, the statistical comparison of tenure-extension policy is averaged over every assistant professor in the sample, regardless of whether the individual professor used the policy. (The authors don’t have data on who took a tenure extension, or even on who had kids.) But causation is only defined for those individuals in whom we could observe a potential outcome at either level of the treatment. In plain English: “How does this policy affect people” only makes sense for people who could have been affected by the policy — meaning people who had kids as assistant professors, and therefore could have taken an extension if one were available. So if the policy did have an effect in this dataset, we should expect it to be a very small one because we are averaging it with a bunch of cases that by definition could not possibly show the effect. In light of that, a larger effect should make us more skeptical, not more persuaded.

There is also the odd finding that in departments that offered tenure extension policies, men took less time to get to tenure (about 1 year less on average). This is the opposite of what you’d expect if “men who took parental leave used the extra year to publish their research” as the NYT writeup claims. The original study authors offer a complicated, speculative story about why time-to-tenure would not be expected to change in the obvious way. If you accept the story, it requires invoking a bunch of mechanisms that are not measured in the paper and likely would add more noise and interpretive ambiguity to the estimates of interest.

There were still other analytic decisions that I had trouble understanding. For example, the authors excluded people who had 0 or 1 publication in their first 2 years. Isn’t this variance to go into the didn’t-get-tenure side of the analyses? And the analyses includes a whole bunch of covariates without a lot of discussion (and no pre-registration to limit researcher degrees of freedom). One of the covariates had a strange effect: holding a degree from a top-10 PhD-granting institution makes it less likely that you will get tenure in your first job. This does make sense if you think that top-10 graduates are likely to get killer outside offers – but then that just reinforces the lack of clarity about what the tenure-in-first-job variable is really an indicator of.

But when all is said and done, probably the most important part of the paper is two sentences right on the title page:

IZA Discussion Papers often represent preliminary work and are circulated to encourage discussion. Citation of such a paper should account for its provisional character.

The NYT writeup does no such thing; in fact it goes the opposite direction, trying to draw broad generalizations and make policy recommendations. This is no slight against the study’s original authors – it is typical in economics to circulate working papers for discussion and critique. Maybe they’d have compelling responses to everything I said, who knows? But at this stage, I have a hard time seeing how this working paper is ready for a popular media writeup for general consumption.

The biggest worry I have is that university administrators might take these results and run with them. I do agree with the fundamental motivation for doing this study, which is that policies need to be evaluated by their effects. Sometimes superficially gender-neutral policies have disparate impacts when they run into the realities of biological and social roles (“primary caregiver” leave policies being a case in point). It’s fairly obvious that in many ways the academic workplace is not structured to support involved parenthood, especially motherhood. But are tenure extension policies making the problem worse, better, or are they indifferent? For all the reasons outlined above, I don’t think this research gives us an actionable answer. Policy should come from cumulative knowledge, not noisy and ambiguous preliminary findings.

In the meantime, what administrator would not love to be able to put on the appearance of We Are Doing Something by rolling back a benefit? It would be a lot cheaper and easier than fixing disparities in the hiring process, providing subsidized child care, or offering true paid leave. I hope this piece does not license them to do that.

Thanks to Ryan Light and Rich Lucas, who dug into the paper and first raised some of the issues I discuss here in response to my initial perplexed tweet.

Does psilocybin cause changes in personality? Maybe, but not so fast

This morning I came across a news article about a new study claiming that psilocybin (the active ingredient in hallucinogenic mushrooms) causes lasting changes in personality, specifically the Big Five factor of openness to experience.

It was hard to make out methodological details from the press report, so I looked up the journal article (gated). The study, by Katherine MacLean, Matthew Johnson, and Roland Griffiths, was published in the Journal of Psychopharmacology. When I read the abstract I got excited. Double blind! Experimentally manipulated! Damn, I thought, this looks a lot better than I thought it was going to be.

The results section was a little bit of a letdown.

Here’s the short version: Everybody came in for 2 to 5 sessions. In session 1 some people got psilocybin and some got a placebo (the placebo was methylphenidate, a.k.a., Ritalin; they also counted as “placebos” some people who got a very low dose of psilocybin in their first session). What the authors report is a significant increase in NEO Openness from pretest to after the last session. That analysis is based on the entire sample of N=52 (everybody got an active dose of psilocybin at least once before the study was over). In a separate analysis they report no significant change from pretest to after session 1 for the n=32 people who got the placebo first. So they are basing a causal inference on the difference between significant and not significant. D’oh!

To make it (even) worse, the “control” analysis had fewer subjects, hence less power, than the “treatment” analysis. So it’s possible that openness increased as much or even more in the placebo contrast as it did in the psilocybin contrast. (My hunch is that’s not what happened, but it’s not ruled out. They didn’t report the means.)

None of this means there is definitely no effect of psilocybin on Openness; it just means that the published paper doesn’t report an analysis that would answer that question. I hope the authors, or somebody else, come back with a better analysis. (A simple one would be a 2×2 ANOVA comparing pretest versus post-session-1 for the placebo-first versus psilocybin-first subjects. A slightly more involved analysis might involve a multilevel model that could take advantage of the fact that some subjects had multiple post-psilocybin measurements.)

Aside from the statistics, I had a few observations.

One thing you’d worry about with this kind of study – where the main DV is self-reported – is demand or expectancy effects on the part of subjects. I know it was double-blind, but they might have a good idea about whether they got psilocybin. My guess is that they have some pretty strong expectations about how shrooms are supposed to affect them. And these are people who volunteered to get dosed with psilocybin, so they probably had pretty positive expectations. I wouldn’t call the self-report issue a dealbreaker, but in a followup I’d love to see some corroborating data (like peer reports, ecological momentary assessments, or a structured behavioral observation of some kind).

On the other hand, they didn’t find changes in other personality traits. If the subjects had a broad expectation that psilocybin would make them better people, you would expect to see changes across the board. If their expectations were focused around Openness-related traits, that’s less relevant.

If you accept the validity of the measures, it’s also noteworthy that they didn’t get higher in neuroticism — which is not consistent with what the government tells you will happen if you take shrooms.

One of the most striking numbers in the paper is the baseline sample mean on NEO Openness — about 64. That is a T-score (normed [such as it is] to have a mean = 50, SD = 10). So that means that in comparison to the NEO norming sample, the average person in this sample was about 1.4 SDs above the mean — which is above the 90th percentile — in Openness. I find that to be a fascinating peek into who volunteers for a psilocybin study. (It does raise questions about generalizability though.)

Finally, because psilocybin was manipulated within subjects, the long-term (one year-ish) followup analysis did not have a control group. Everybody had been dosed. They predicted Openness at one year out based on the kinds of trip people reported (people who had a “complete mystical experience” also had the sustained increase in openness). For a much stronger inference, of course, you’d want to manipulate psilocybin between subjects.

Do not use what I am about to teach you

I am gearing up to teach Structural Equation Modeling this fall term. (We are on quarters, so we start late — our first day of classes is next Monday.)

Here’s the syllabus. (pdf)

I’ve taught this course a bunch of times now, and each time I teach it I add more and more material on causal inference. In part it’s a reaction to my own ongoing education and evolving thinking about causation, and in part it’s from seeing a lot of empirical work that makes what I think are poorly supported causal inferences. (Not just articles that use SEM either.)

Last time I taught SEM, I wondered if I was heaping on so many warnings and caveats that the message started to veer into, “Don’t use SEM.” I hope that is not the case. SEM is a powerful tool when used well. I actually want the discussion of causal inference to help my students think critically about all kinds of designs and analyses. Even people who only run randomized experiments could benefit from a little more depth than the sophomore-year slogan that seems to be all some researchers (AHEM, Reviewer B) have been taught about causation.

Modeling the Jedi Theory of Emotions

Today I gave my structural equation modeling class the following homework:

In Star Wars I: The Phantom Menace, Yoda presented the Jedi Theory of Emotions:  “Fear is the path to the dark side. Fear leads to anger. Anger leads to hate. Hate leads to suffering.”

1. Specify the Jedi Theory of Emotions as a path model with 4 variables (FEAR, ANGER, HATE, and SUFFERING). Draw a complete path diagram, using lowercase Roman letters (a, b, c, etc.) for the causal parameters.

2. Were there any holes or ambiguities in the Jedi Theory (as stated by Yoda) that required you to make theoretical assumptions or guesses? What were they?

3. Using the tracing rule, fill in the model-implied correlation matrix (assuming that all variables are standardized):


4. Generate a plausible equivalent model. (An equivalent model is a model that specifies a different causal structure but implies the same correlation matrix.)

5. Suppose you run a study and collect data on these four variables. Your data gives you the following correlation matrix.

ANGER .5 1
HATE .3 .6 1
SUFFERING .4 .3 .5 1

Is the Jedi Theory a good fit to the data? In what way(s), if any, would you revise the model?

Some comments…

For #1, everybody always comes up with a recursive, full mediation model — e.g., fear only causes hate via anger as an intervening cause, and there are no loops or third-variable associations between fear and hate, etc. It’s an opportunity to bring up the ambiguity of theories expressed in natural language: just because Yoda didn’t say “and anger can also cause fear sometimes too,” does that mean he’s ruling that out?

Relatedly, observational data will only give you unbiased causal estimates — of the effect of fear on anger, for example — if you assume that Yoda gave a complete and correct specification of the true causal structure (or if you fill in the gaps yourself and include enough constraints to identify the model). How much do you trust Yoda’s model? Questions 4 and 5 are supposed to help students to think about ways in which the model could and could not be falsified.

In a comment on an earlier post, I repeated an observation I once heard someone make, that psychologists tend to model all relationships as zero unless given reason to think otherwise, whereas econometricians tend to model all relationships as free parameters unless given reason to think otherwise. I’m not sure why that is the case (maybe a legacy of NHST in experimental psychology, where you’re supposed to start by hypothesizing a zero relationship and then look for reasons to reject that hypothesis). At any rate, if you think like an econometrician and come from the no true zeroes school of thought, you’ll need something more than just observational data on 4 variables in order to test this model. That makes the Jedi Theory a tough nut to crack. Experimental manipulation gets ethically more dubious as you proceed down the proposed causal chain. And I’m not sure how easy it would be to come up with good instruments for all of these variables.

I also briefly worried that I might be sucking the enjoyment out of the movie. But then I remembered that the quote is from The Phantom Menace, so that’s already been done.

Prepping for SEM

I’m teaching the first section of a structural equation modeling class tomorrow morning. This is the 3rd time I’m teaching the course, and I find that the more times I teach it, the less traditional SEM I actually cover. I’m dedicating quite a bit of the first week to discussing principles of causal inference, spending the second week re-introducing regression as a modeling framework (rather than a toolbox statistical test), and returning to causal inference later when we talk about path analysis and mediation (including assigning a formidable critique by John Bullock et al. coming out soon in JPSP).

The reason I’m moving in that direction is that I’ve found that a lot of students want to rush into questionable uses of SEM without understanding what they’re getting into. I’m probably guilty of having done that, and I’ll probably do it again someday, but I’d like to think I’m learning to be more cautious about the kinds of inferences I’m willing to make. To people who don’t know better, SEM often seems like magical fairy dust that you can sprinkle on cross-sectional observational data to turn it into something causally conclusive. I’ve probably been pretty far on the permissive end of the spectrum that Andrew Gelman talks about, in part because I think experimental social psychology sometimes overemphasizes internal validity to the exclusion of external validity (and I’m not talking about the special situations that Mook gets over-cited for). But I want to instill an appropriate level of caution.

BTW, I just came across this quote from Donald Campbell and William Shadish: “When it comes to causal inference from quasi-experiments, design rules, not statistics.” I’d considered writing “IT’S THE DESIGN, STUPID” on the board tomorrow morning, but they probably said it nicer.

Causality, genes, and the law

Ewen Callaway in New Scientist reports:

In 2007, Abdelmalek Bayout admitted to stabbing and killing a man and received a sentenced of 9 years and 2 months. Last week, Nature reported that Pier Valerio Reinotti, an appeal court judge in Trieste, Italy, cut Bayout’s sentence by a year after finding out he has gene variants linked to aggression. Leaving aside the question of whether this link is well enough understood to justify Reinotti’s decision, should genes ever be considered a legitimate defence?

Short answer: probably not.

Long answer: This reminds me of an issue I have with the Rubin Causal Model. In Holland’s 1986 paper on the RCM, he has a section titled “What can be a cause?” He introduces the notion of potential exposability – basically the idea that something can only be a cause if you could, in principle, manipulate it. He contrasts causes with attributes – features of individuals that are part of the definition of the individual. He uses as an example the statement, “She did well on the exam because she is a woman.” Gender can be statistically associated (correlated) with an outcome, but it cannot be a cause (according to Holland and I believe Rubin as well), because the person who did well on the exam would not be the same person if “she” weren’t a woman.

From a scientific/philosophical level, I’ve never liked the way they make the cause/attribute distinction. The RCM is so elegant and logical and principled, and then they tack on this very pragmatic and mushy issue of what can and cannot be manipulated. If technology changes to where something becomes manipulable, or if someone else thinks of a manipulation that escapes the researcher’s imagination (sex reassignment surgery?), things can shift back and forth from being classed as causes versus as attributes. Philosophically speaking: Blech. Plus, it leads to places I don’t really like. What about: “Jane didn’t get the job because she is a woman.” Is Holland saying that we cannot say that an applicant’s gender affected the employer’s hiring decision?

I think we just need to be better about defining the units and the nature of the counterfactuals. If we are trying to draw inferences about Jane, as she existed on a specific date and time and location, and therefore as a principled matter of defining the question (not as a pragmatic concern) we take as an a priori fact that Jane for the purposes of this problem has to be a woman, then okay, we’ve defined our problem space in a particular way that excludes “is a man” as a potential state of Jane. But if we are trying to draw inferences in which the units are exam-takers or job applicants, and Jane is one of many potential members of that population of units, then we’re dealing with a totally different question. In that case, we could have had either a man or a woman take the exam or apply for the job. Put another way: what is the counterfactual to Jane taking the exam or Jane applying for the job? If Jane could have been John for purposes of the problem that we are trying to solve, then it makes perfectly good sense to say that “Jane did well on the exam because she is a woman” is a coherent causal inference. It goes back to a principled matter of how we have defined the problem. Not a practical question of manipulability.

So back to the criminal… Holland (and Rubin) would make the question, “Is the MAOA-L variant a cause or an attribute?” And then they’d get into questions of whether you could manipulate that gene. And right now we cannot, so it’s an attribute; but maybe someday we’ll be able to, and then it’ll be a cause.

But I’d instead approach it by asking: what are the units, and what’s the counterfactual? To a scientist, it makes perfect sense to formulate a causal-inference problem in which the universe of units consists of all possible persons. Then we compare two persons whose genomes are entirely identical except for their MAOA variant, and we ask what the potential outcomes would be if one vs. the other was put in some situation that allows you to measure aggressive behavior. So the scientist gets to ask questions about MAOA causing aggression, because the scientist is drawing inferences about how persons behave, and MAOA is a variable across those units (generic persons).

But a court is supposed to ask different kinds of causal questions. The court judges the actual individual before it. And the units are potential or actual actions of that specific person as he existed on the day of the alleged crime. The units are not members of the generic category of persons. Thus, the court should not be considering what would happen if the real Abdelmalek Bayout had been replaced by a hypothetical almost-Bayout with a minutely different genome. A scientist can go there, but a court cannot. Rather, the court’s counterfactual is a different behavior from the very same real-world Abdelmalek Bayout, i.e., a Bayout who didn’t stab anybody on that day in 2007. And if Bayout had not stabbed anybody, there’d be no murder. But since he did, he caused a murder.

Addendum: it’s a totally different question of whether we want to hold all persons to the same standards. For example, we have the insanity defense. But there, it’s not a question of causality. In fact, defendants who plead insanity have to stipulate to the causal question (e.g. in a murder trial, they have to acknowledge that the defendant’s actions caused the death of another). The question before the court basically becomes a descriptive question — is this person sane or insane? — not a causal one.