Norms for the Big Five Inventory and other personality measures

Every once in a while I get emails asking me about norms for the Big Five Inventory. I got one the other day, and I figured that if more than one person has asked about it, it’s probably worth a blog post.

There’s a way of thinking about norms — which I suspect is the most common way of thinking about norms — that treats them as some sort of absolute interpretive framework. The idea is that you could tell somebody, hey, if you got this score on the Agreeableness scale, it means you have this amount of agreeableness.

But I generally think that’s not the right way of thinking about it. Lew Goldberg put it this way:

One should be very wary of using canned “norms” because it isn’t obvious that one could ever find a population of which one’s present sample is a representative subset. Most “norms” are misleading, and therefore they should not be used.

That is because “norms” are always calculated in reference to some particular sample, drawn from some particular population (which BTW is pretty much never “the population of all human beings”). Norms are most emphatically NOT an absolute interpretation — they are unavoidably comparative.

So the problem arises because the usual way people talk about norms tends to bury that fact. So people say, oh, you scored at the 70th percentile. They don’t go on to say the 70th percentile of what. For published scales that give normed scores, it often turns out to mean the 70th percentile of the distribution of people who somehow made it into the scale author’s convenience sample 20 years ago.

So what should you do to help people interpret their scores? Lew’s advice is to use the sample you have at hand to construct local norms. For example, if you’re giving feedback to students in a class, tell them their percentile relative to the class.

Another approach is to use distributional information from existing dataset and just be explicit about what comparison you are making and where the data come from. For the BFI, I sometimes refer people to a large dataset of adult American Internet users that I used for a paper. Sample descriptives are in the paper, and we’ve put up a table of means and SDs broken down by age and gender for people who want to make those finer distinctions. You can then use those means and SDs to convert your raw scores into z-scores, and then calculate or look up the normal-distribution percentile. You would then say something like, “This is where you stand relative to a bunch of Internet users who took this questionnaire online.” (You don’t have to use that dataset, of course. Think about what would be an appropriate comparison group and then poke around Google Scholar looking for a paper that reports descriptive statistics for the kind of sample you want.)

Either the “local norms” approach or the “comparison sample” approach can work for many situations, though local norms may be difficult for very small samples. If the sample as a whole is unusual in some way, the local norms will remove the average “unusualness” whereas the comparison-sample approach will keep it in there, and you can decide which is the more useful comparison. (For example, an astronaut who scores in the 50th percentile of conscientiousness relative to other astronauts would be around the 93rd percentile relative to college undergrads.) But the most important thing is to avoid anything that sounds absolute. Be consistent and clear about the fact that you are making comparisons and about who you are comparing somebody to.

Does psilocybin cause changes in personality? Maybe, but not so fast

This morning I came across a news article about a new study claiming that psilocybin (the active ingredient in hallucinogenic mushrooms) causes lasting changes in personality, specifically the Big Five factor of openness to experience.

It was hard to make out methodological details from the press report, so I looked up the journal article (gated). The study, by Katherine MacLean, Matthew Johnson, and Roland Griffiths, was published in the Journal of Psychopharmacology. When I read the abstract I got excited. Double blind! Experimentally manipulated! Damn, I thought, this looks a lot better than I thought it was going to be.

The results section was a little bit of a letdown.

Here’s the short version: Everybody came in for 2 to 5 sessions. In session 1 some people got psilocybin and some got a placebo (the placebo was methylphenidate, a.k.a., Ritalin; they also counted as “placebos” some people who got a very low dose of psilocybin in their first session). What the authors report is a significant increase in NEO Openness from pretest to after the last session. That analysis is based on the entire sample of N=52 (everybody got an active dose of psilocybin at least once before the study was over). In a separate analysis they report no significant change from pretest to after session 1 for the n=32 people who got the placebo first. So they are basing a causal inference on the difference between significant and not significant. D’oh!

To make it (even) worse, the “control” analysis had fewer subjects, hence less power, than the “treatment” analysis. So it’s possible that openness increased as much or even more in the placebo contrast as it did in the psilocybin contrast. (My hunch is that’s not what happened, but it’s not ruled out. They didn’t report the means.)

None of this means there is definitely no effect of psilocybin on Openness; it just means that the published paper doesn’t report an analysis that would answer that question. I hope the authors, or somebody else, come back with a better analysis. (A simple one would be a 2×2 ANOVA comparing pretest versus post-session-1 for the placebo-first versus psilocybin-first subjects. A slightly more involved analysis might involve a multilevel model that could take advantage of the fact that some subjects had multiple post-psilocybin measurements.)

Aside from the statistics, I had a few observations.

One thing you’d worry about with this kind of study – where the main DV is self-reported – is demand or expectancy effects on the part of subjects. I know it was double-blind, but they might have a good idea about whether they got psilocybin. My guess is that they have some pretty strong expectations about how shrooms are supposed to affect them. And these are people who volunteered to get dosed with psilocybin, so they probably had pretty positive expectations. I wouldn’t call the self-report issue a dealbreaker, but in a followup I’d love to see some corroborating data (like peer reports, ecological momentary assessments, or a structured behavioral observation of some kind).

On the other hand, they didn’t find changes in other personality traits. If the subjects had a broad expectation that psilocybin would make them better people, you would expect to see changes across the board. If their expectations were focused around Openness-related traits, that’s less relevant.

If you accept the validity of the measures, it’s also noteworthy that they didn’t get higher in neuroticism — which is not consistent with what the government tells you will happen if you take shrooms.

One of the most striking numbers in the paper is the baseline sample mean on NEO Openness — about 64. That is a T-score (normed [such as it is] to have a mean = 50, SD = 10). So that means that in comparison to the NEO norming sample, the average person in this sample was about 1.4 SDs above the mean — which is above the 90th percentile — in Openness. I find that to be a fascinating peek into who volunteers for a psilocybin study. (It does raise questions about generalizability though.)

Finally, because psilocybin was manipulated within subjects, the long-term (one year-ish) followup analysis did not have a control group. Everybody had been dosed. They predicted Openness at one year out based on the kinds of trip people reported (people who had a “complete mystical experience” also had the sustained increase in openness). For a much stronger inference, of course, you’d want to manipulate psilocybin between subjects.

Is there anything special about the Five-Factor Model?

I recently put up a clip-job list of all the ideas I’ve been too busy or lazy to flesh out into real posts in the last month. One of the items was about a recent Psych Inquiry commentary I wrote in response to a piece by Jack Block. Tal actually read the commentary (thanks, Tal!) and commented:

…What I couldn’t really get a sense of from your paper is whether you actually believe there’s anything special about the FFM as distinct from any number of other models, or if you view it as just a matter of convenience that the FFM happens to be the most widely adopted model. I suspect Block would have said that even if you think the FFM is all in the eyes of the beholder, there’s still no good reason to think that it’s the right structure, and that with only slightly different assumptions and a slightly different historical trajectory, we could all have been working with a six or seven-factor model. So I guess my question would be: should one read the title of your paper as saying that the FFM is the model that describes the structure of social perceptions, or are you making a more general point about all psychometric models based on semantically-mediated observations?

That’s a great question.

As I think I make clear in the paper, I think it’s highly unlikely that the FFM is isomorphic with some underlying, extra-perceptual reality of bodies or behavior. In other words, I don’t expect we’ll find five brain systems whose functioning maps one-to-one onto the five factors. I could be wrong, but I have seen exactly zero evidence that makes me think that’s the case.

But since I argue in the paper that the FFM is a model of the social concerns of ordinary social perceivers, I think it’s fair to ask whether it’s isomorphic with something else. Like maybe there are five basic, universal social concerns that all humans share, or something like that. And my answer is… no, I don’t think so.

For one thing, I don’t think the cross-cultural evidence is strong enough to support that conclusion. (Being in the same department as Gerard Saucier has helped me see that.) McCrae and Costa have done a very good job of showing that the FFM can be exported to other cultures — if we give people the FFM as a meaning system, they’ll use it in roughly the way we expect. But emic studies have been a lot more varied.

I also am not convinced that factor analysis — a method that derives independent factors from between-person covariance structures — is the “true” way to model person perception and social meaning. Useful? As a way of deriving a descriptive/taxonomic model, absolutely. Orthogonal factor analysis has some very useful properties, like mapping a multidimensional space very efficiently. And there’s a consistent something behind that useful model, in the sense that something is causing that five-factor structure to replicate (conditional on the item selection procedures, samples from certain cultures, statistical assumptions, etc.).

But there’s no reason to think that that means the five-factor structure has a simple, one-to-one relationship to whatever reality it’s grounded in — whether the reality of target persons’ behavior or of perceivers’ concerns. Why would social concerns be orthogonal (and by implication, causally unrelated to one another)? Why, if these are major themes in human social concerns, don’t we have good words for them at the five-factor level of abstraction? (“Agreeableness”? Blech. Worst factor label ever.) Why do they emerge in the between-person covariance structure but not in experimental methods that probe social representation at the individual level (ala Dabady, Bell, & Kihlstrom, 1999)?

As to Tal’s last question (“are you making a more general point about all psychometric models based on semantically-mediated observations?”): I think I say this in the paper, but I don’t think there is, or ever will be, any structural model of personality that isn’t pivotally dependent on human perception and judgment. (Ouch, double negative. Put more straightforwardly: all models of personality depend on human interpretations of personality.) I have a footnote where I comment that the Q sort can be seen as a model of what Jack Block wants to know about persons. I’ll even extend that to models that use biological constructs as their units rather than linguistic ones, but maybe I’ll save that argument for another day…

The Five-Factor Model in the DSM-5

Via Neuroskeptic, I just found out that the Big Five have been proposed to appear (sorta) in the DSM-5.

The current Axis II disorders will be replaced by a mixture of continuously-rated personality disorder types (carrying forward psychopathy, avoidant, borderline, obsessive-compulsive, and schizotypal) and 6 personality traits. According to the rationale, four of the traits are pathological versions of 4 of the Big Five (Openness/Intellect apparently doesn’t have a pathological extreme).

I need to read more about it, but it’s not clear to me how redundant the types and traits will be, and whether that’s by design. For example, the typology includes a schizitypal type, and the trait space includes a schizotypy dimension (the latter based on David Watson’s work suggesting that trait terms referring to oddness/eccentricity should not have been excluded from the lexical sampling that produced the Big Five). Both are continuously rated — will they provide complementary information, or will they just say the same thing?

One good thing, though, is the shift toward using continuous ratings rather than yes/no categories. This will potentially create practical problem for the healthcare system (if something is continuous, at what point do you decide that insurance will reimburse treatment?), but scientifically it is better in line with what we know about the underlying nature of personality and personality disorders.