An interesting study of why unstructured interviews are so alluring

A while back I wrote about whether grad school admissions interviews are effective. Following up on that, Sam Gosling recently passed along an article by Dana, Dawes, and Peterson from the latest issue of Judgment and Decision Making:

Belief in the unstructured interview: The persistence of an illusion

Unstructured interviews are a ubiquitous tool for making screening decisions despite a vast literature suggesting that they have little validity. We sought to establish reasons why people might persist in the illusion that unstructured interviews are valid and what features about them actually lead to poor predictive accuracy. In three studies, we investigated the propensity for “sensemaking” – the ability for interviewers to make sense of virtually anything the interviewee says—and “dilution” – the tendency for available but non-diagnostic information to weaken the predictive value of quality information. In Study 1, participants predicted two fellow students’ semester GPAs from valid background information like prior GPA and, for one of them, an unstructured interview. In one condition, the interview was essentially nonsense in that the interviewee was actually answering questions using a random response system. Consistent with sensemaking, participants formed interview impressions just as confidently after getting random responses as they did after real responses. Consistent with dilution, interviews actually led participants to make worse predictions. Study 2 showed that watching a random interview, rather than personally conducting it, did little to mitigate sensemaking. Study 3 showed that participants believe unstructured interviews will help accuracy, so much so that they would rather have random interviews than no interview. People form confident impressions even interviews are defined to be invalid, like our random interview, and these impressions can interfere with the use of valid information. Our simple recommendation for those making screening decisions is not to use them.

It’s an interesting study. In my experience people’s beliefs in unstructured interviews are pretty powerful — hard to shake even when you show them empirical evidence.

I did have some comments on the design and analyses:

1. In Studies 1 and 2, each subject made a prediction about absolute GPA for 1 interviewee. So estimates of how good people are at predicting GPA from interviews are based on entirely between-subjects comparisons. It is very likely that a substantial chunk of the variance in predictions will be due to perceiver variance — differences between subjects in their implicit assumptions about how GPA is distributed. (E.g., Subject 1 might assume most GPAs range from 3 to 4, whereas Subject 2 assumes most GPAs range from 2.3 to 3.3. So even if they have the same subjective impression of the same target — “this person’s going to do great this term” — their numerical predictions might differ by a lot.) That perceiver variance would go into the denominator as noise variance in this study, lowering the interviewers’ predictive validity correlations.

Whether that’s a good thing or a bad thing depends on what situation you’re trying to generalize to. Perceiver variance would contribute to errors in judgment when each judge makes an absolute decision about a single target. On the other hand, in some cases perceivers make relative judgments about several targets, such as when an employer interviews several candidates and picks the best one. In that setting, perceiver variance would not matter, and a study with this design could underestimate accuracy.

2. Study 1 had 76 interviewers spread across 3 conditions (n = 25 or 26 per condition), and only 7 interviewees (each of whom was rated by multiple interviewers). Based on 73 degrees of freedom reported for the test of the “dilution” effect, it looks like they treated interviewer as the unit of analysis but did not account for the dependency in interviewees. Study 2 looked to have similar issues (though in Study 2 the dilution effect was not significant.)

3. I also had concerns about power and precision of the estimates. Any inferences about who makes better or worse predictions will depend a lot on variance among the 7 interviewees whose GPAs were being predicted (8 interviewees in study 2). I haven’t done a formal power analysis, but my intuition is that that’s pretty small. You can see a possible sign of this in one key difference between the studies. In Study 1, the correlation between the interviewees’ prior GPA and upcoming GPA was r = .65, but in Study 2 it was r = .37. That’s a pretty big difference between estimates of a quantity that should not be changing between studies.

So it’s an interesting study but not one that can give answers I’d call definitive. If that’s well understood by readers of the study, I’m okay with that. Maybe someone will use the interesting ideas in this paper as a springboard for a larger followup. Given the ubiquity of unstructured interviews, it’s something we need to know more about.

The precisely fuzzy science of gaydar

Recently Dan Savage stirred up controversy by suggesting on his podcast that Marcus Bachmann, husband of Michele Bachmann and a therapist who tries to turn gays and lesbians into heterosexuals, is secretly gay. Since reparative therapy does not work and causes harm, Savage has ample grounds to criticize him for his therapeutic practice. However, Savage went beyond calling him a bad therapist and suggested that his voice, appearance, and mannerisms mark him as gay. In support, Savage cited research on “gaydar.” Here’s what he said:

People used to talk about gaydar and debate whether it was real or existed or not. Now there’s been all sorts of tests that actually people have really good gaydar, and they can look at a picture or listen to a clip of someone’s voice… and with eerie accuracy nail their actual sexual orientation.

I was going to write up a post on gaydar research (known in the biz as interpersonal perception of sexual orientation), but William Saletan over at Slate did a nice job of summarizing the key findings. Let me instead add some commentary.

The statistical tests in the studies show that the accuracy rates are better than a random guess. That’s a real finding — it tells us that there’s some information about sexual orientation available in thin slices of appearance and/or behavior. But what about Savage’s claims that people are “really good” and have “eerie accuracy”? That’s an effect size question. Saletan sort of bungles this – he reports correlation coefficients of around .30 (erroneously capitalizing r), but not knowing what to do with them, he squares them to get variance explained. That makes it sound smallish (absent context, most people will think that 9% of something sounds small), but it’s sort of silly: if a reader doesn’t know what a correlation coefficient is, they won’t know what variance is either.

It turns out some of the articles do report accuracy rates as percentages. And even if you didn’t have those numbers, you could rough out the accuracy rates with a binomial effect size display. In a typical study, half of the targets are gay/lesbian and half are straight, so a purely random guesser (i.e., someone with no gaydar) would be around 50%. The reported accuracy rates in the articles, as well as the BESD conversion, say that people guess correctly about 65% of the time. Better than chance, but nowhere near perfect.

In fact, you can go a step further and get Bayesian on the problem. Let’s assume that the 65% accuracy rate is symmetric — that guessers are just as good at correctly identifying gays/lesbians as they are in identifying straight people. Let’s also assume that 5% of people are actually gay/lesbian. From those numbers, a quick calculation tells us that for a randomly-selected member of the population, if your gaydar says “GAY” there is a 9% chance that you are right. Eerily accurate? Not so much. If you rely too much on your gaydar, you are going to make a lot of dumb mistakes.

That calculation isn’t meant to be taken too seriously though, because it makes some other assumptions. For example, I’m assuming that the 65% accuracy rate in these controlled lab studies would apply to real-world guessing situations. But Saletan is spot-on in pointing out that all of the targets in the studies were out. (Depending on the study, stimuli were created from personal ads, from college undergraduates’ facebook profiles that listed sexual orientation, or grad students who were members of LGBT or public-service organizations.) An out, sexually active twentysomething probably wants others to know their orientation in a way that a middle-aged closeted person would not. How controllable are the signals that people send about their orientation? Researchers have been working on that question, by trying to isolate the various channels of information (voice, gesture, hairstyle, facial expressions and features, etc.). Looking at the studies I’d say the jury’s still out, but that at least some of the signals are indeed controllable.

Let me conclude by saying that these are common kinds of errors we make when jumping too quickly from controlled lab studies to real-world applications. As I said earlier, the finding that accuracy was significantly better than chance is a meaningful one. It tells us that there is information there. But the interpretation has to be a narrow and careful one, because the finding raises many more questions than it answers — questions about what the signals are, who is and isn’t sending them, under what circumstances they will and won’t be present, how and why they are being sent and received, and much more. It’s the start of an inquiry, not the end of one. And all too easy to overinterpret.

Want to make people cry? Try sad kids, sad animals, or sad animation

Among the difficulties of doing experimental research on emotions is getting people to have them in the lab, where you can study them up close. There are quite a few ways researchers try to elicit emotions — in fact, half of a recent book is dedicated to the topic.

One of the most common approaches is to show subjects film clips. In principle, film clips ought to have a lot of advantages for an experimenter. Unlike asking people to recall personal memories, film clips are standardized – everybody gets the same treatment, so there are no differences in the content of the emotion-eliciting stimulus. And film clips can be a lot more engrossing and evocative than other standardizable stimuli like pictures or music.

That’s the ideal. In practice, though, it can be very hard to find film clips that will elicit a similar reaction from lots of different people. One person’s tearjerker is another person’s boring chick-flick. In fact, when I was part of a team a few years back that was developing a set of new film clips to elicit sadness in the lab, the two female grad students that were trying to find the clips kept getting pilot data showing that the men were unmoved by anything. It turned out that the grads were picking clips that they personally found sad — which was all Beaches-style stuff about women’s relationships with women. We eventually had to ban anything with Susan Sarandon. The stuff that worked the best with everybody, men and women alike, turned out to be clips of sad kids and sad animals. (Futurama fans will know what I’m talking about. Two words: Jurassic Bark.)

Perhaps that shouldn’t have been too much of a surprise. At the time, the state of the art in sadness elicitation was a clip from The Champ where a seven-year-old Ricky Schroder watches his father die in front of him. That one still works well, and the other clips that ended up working were similar themes.

Now, according to a recent article in Time, it seems like we can add animated films to the list of guaranteed tear-elicitors. Apparently there was an epidemic of adults weeping at screenings of Toy Story 3. I haven’t seen that one, but I did see Up, and you’d have to be a psychopath not to at least well up a little bit during the flashback sequence. A filmmaker has an interesting theory on why that may be:

Lee Unkrich, who, having directed Toy Story 3, co-directed and edited Toy Story 2  and edited the original, is something of an expert; he has a few theories on why the latest film set people off. The most interesting is that animated movies can be more affecting than movies with real people in them. “Live action movies are someone else’s story,” he says. “With animation, audiences can’t think that. Their guards are down.” Because the characters are clearly not alive, he suggests counterintuitively, people identify with them more readily.

It’s an interesting explanation, and it becomes especially interesting when you try to extend it to animation of adult human characters (like Up) or live-action movies about kids. Why is a live Susan Sarandon perceived as “somebody else” by a substantial part of the audience (especially people of a different age group and gender), but audiences have no problem immersing themselves into an animated Carl and Ellie Fredricksen or a live, seven-year-old Ricky Schroder? What triggers that barrier with some people and drops it with others? The answers would reach far past methodological questions about how to elicit emotions in the lab, and get at basic questions of empathy and identity.

Is there anything special about the Five-Factor Model?

I recently put up a clip-job list of all the ideas I’ve been too busy or lazy to flesh out into real posts in the last month. One of the items was about a recent Psych Inquiry commentary I wrote in response to a piece by Jack Block. Tal actually read the commentary (thanks, Tal!) and commented:

…What I couldn’t really get a sense of from your paper is whether you actually believe there’s anything special about the FFM as distinct from any number of other models, or if you view it as just a matter of convenience that the FFM happens to be the most widely adopted model. I suspect Block would have said that even if you think the FFM is all in the eyes of the beholder, there’s still no good reason to think that it’s the right structure, and that with only slightly different assumptions and a slightly different historical trajectory, we could all have been working with a six or seven-factor model. So I guess my question would be: should one read the title of your paper as saying that the FFM is the model that describes the structure of social perceptions, or are you making a more general point about all psychometric models based on semantically-mediated observations?

That’s a great question.

As I think I make clear in the paper, I think it’s highly unlikely that the FFM is isomorphic with some underlying, extra-perceptual reality of bodies or behavior. In other words, I don’t expect we’ll find five brain systems whose functioning maps one-to-one onto the five factors. I could be wrong, but I have seen exactly zero evidence that makes me think that’s the case.

But since I argue in the paper that the FFM is a model of the social concerns of ordinary social perceivers, I think it’s fair to ask whether it’s isomorphic with something else. Like maybe there are five basic, universal social concerns that all humans share, or something like that. And my answer is… no, I don’t think so.

For one thing, I don’t think the cross-cultural evidence is strong enough to support that conclusion. (Being in the same department as Gerard Saucier has helped me see that.) McCrae and Costa have done a very good job of showing that the FFM can be exported to other cultures — if we give people the FFM as a meaning system, they’ll use it in roughly the way we expect. But emic studies have been a lot more varied.

I also am not convinced that factor analysis — a method that derives independent factors from between-person covariance structures — is the “true” way to model person perception and social meaning. Useful? As a way of deriving a descriptive/taxonomic model, absolutely. Orthogonal factor analysis has some very useful properties, like mapping a multidimensional space very efficiently. And there’s a consistent something behind that useful model, in the sense that something is causing that five-factor structure to replicate (conditional on the item selection procedures, samples from certain cultures, statistical assumptions, etc.).

But there’s no reason to think that that means the five-factor structure has a simple, one-to-one relationship to whatever reality it’s grounded in — whether the reality of target persons’ behavior or of perceivers’ concerns. Why would social concerns be orthogonal (and by implication, causally unrelated to one another)? Why, if these are major themes in human social concerns, don’t we have good words for them at the five-factor level of abstraction? (“Agreeableness”? Blech. Worst factor label ever.) Why do they emerge in the between-person covariance structure but not in experimental methods that probe social representation at the individual level (ala Dabady, Bell, & Kihlstrom, 1999)?

As to Tal’s last question (“are you making a more general point about all psychometric models based on semantically-mediated observations?”): I think I say this in the paper, but I don’t think there is, or ever will be, any structural model of personality that isn’t pivotally dependent on human perception and judgment. (Ouch, double negative. Put more straightforwardly: all models of personality depend on human interpretations of personality.) I have a footnote where I comment that the Q sort can be seen as a model of what Jack Block wants to know about persons. I’ll even extend that to models that use biological constructs as their units rather than linguistic ones, but maybe I’ll save that argument for another day…

New resource for interpersonal perception researchers

Via Dave Kenny, I just found out about a new set of resources for researchers interested in personality and social relationships — and especially for users of the Social Relations Model.

Persoc is a research network founded by a group of mostly German researchers, although they seem to be interested in bringing people together from all over. From their website:

In September 2007 a group of young researchers who repeatedly met at conferences realized that they were all fascinated by the complex interplay of personality and social relationships. While we studied the effects of personality on very different social processes (e.g., zero acquaintance judgments, group formation, friendship development, mate choice, relationship maintenance), we shared a strong focus on observing real-life phenomena and implementing advanced methods to analyze our data. Since the official start of Persoc in late 2008, several meetings and workshops have deepened both, our interconnectedness as well as our understanding and interest in personality and social relationships. Persoc is funded by the German Research Foundation (DFG).

Among other things, they have created an R package called TripleR for analyzing round-robin data using the SRM componential approach. TripleR is intended as an alternative to the venerable SOREMO software created by Kenny. The persoc website also includes a page discussing theoretical concepts in interpersonal perception, an overview of a number of useful research designs, and other information.

Do people know how much power and status they have?

Do you know how much power and status you have in the important social situations in your life? Cameron Anderson and I have a chapter coming out in a few months looking at that question. The chapter is titled “Accurate When It Counts: Perceiving Power and Status in Social Groups.” (It draws in part on an earlier empirical paper we did together.) The part before the colon probably gives away a little bit of the answer. We present a case that most people, much of the time, are pretty good at perceiving their own and others’ power and status. (Better than they are at perceiving likability or personality traits.)

You can read the chapter if you want to see where the main point is coming from. I just want to briefly comment on a preliminary issue we had to develop along the way…

One of the fun things about writing this paper was working out what it means to be accurate in perceiving power and status. Accuracy has a long and challenging history in social perception research. How do you quantify how well somebody knows somebody else’s (or their own) likability, extraversion, morality, or — in our case — power or status?

We started by creating working definitions of power and status. What became clear along the way is that the accuracy question gets answered differently for power than for status because of the different definitions. For power, we adopted Susan Fiske’s definition that power is asymmetric outcome control (in a nutshell, Person A has power over Person B if A has control over B’s valued outcomes). For status, we defined it as respect and influence in the eyes of others.

Drawing on those definitions, here’s what we say about how to define accuracy in perceiving power:

The outcome-control framework is useful for studying perceptions. Outcome control is a structural property of relationships that does not depend on any person’s construal of a situation. Thus, one person may have power over another person even if one or both people do not realize it at a given time. (For example, a late-night TV host and the female intern he dates might both think about their relationship in purely romantic terms, but the fact that the host makes decisions about the intern’s salary and career advancement means that he has power over her). Because the outcome-control framework separates psychological processes such as the perception of power from power per se, it is conceptually coherent to ask questions about the accuracy of perceptions.

And here’s how accuracy is different for status:

Like power, status is a feature of a relationship (Fiske & Berdahl, 2007). Like power, status may vary from one situation to another. And like with power, it is possible for a single individual to misperceive her own status or the status of another person. However, because status is about respect and prestige in the eyes of others, at its core it involves collective perceptions – that is, status is a component of reputation. Thus status is socially constructed in a different and perhaps more fundamental way than power. Whereas it might make sense to say that an individual has power but nobody knows it, it would not make sense to say the same about status. This gives status a complicated but necessary relation to interpersonal perceptions, which will become important when we consider what it means to be accurate in perceiving status.

On a side note: egads, am I becoming a social constructivist?


Srivastava, S. & Anderson, C. (in press). Accurate when it counts: Perceiving power and status in social groups. In J. L. Smith, W. Ickes, J. Hall, S. D. Hodges, & W. Gardner (Eds.), Managing interpersonal sensitivity: Knowing when—and when not—to understand others.

Perceiver effects in interpersonal perception

Hot off the presses is a paper I wrote with Steve Guglielmo and Jenni Beer on perceiver effects in the Social Relations Model. Here’s the abstract:

In interpersonal perception, “perceiver effects” are tendencies of perceivers to see other people in a particular way. Two studies of naturalistic interactions examined perceiver effects for personality traits: seeing a typical other as sympathetic or quarrelsome, responsible or careless, and so forth. Several basic questions were addressed. First, are perceiver effects organized as a global evaluative halo, or do perceptions of different traits vary in distinct ways? Second, does assumed similarity (as evidenced by self-perceiver correlations) reflect broad evaluative consistency or trait-specific content? Third, are perceiver effects a manifestation of stable beliefs about the generalized other, or do they form in specific contexts as group-specific stereotypes? Findings indicated that perceiver effects were better described by a differentiated, multidimensional structure with both trait-specific content and a higher order global evaluation factor. Assumed similarity was at least partially attributable to trait-specific content, not just to broad evaluative similarity between self and others. Perceiver effects were correlated with gender and attachment style, but in newly formed groups, they became more stable over time, suggesting that they grew dynamically as group stereotypes. Implications for the interpretation of perceiver effects and for research on personality assessment and psychopathology are discussed.

A couple of quick comments to add:

  • This is an example of using the Big Five / Five-Factor Model not as a model of personality per se, but as a model of social perception. I very briefly mention this potential use of the Big Five in my guide to measuring the Big Five, and I’m currently working on a manuscript expanding on this idea. (BTW, I’m certainly not the first person to think of the Big Five in this way. I’m trying to carry this idea forward a bit, but it’s one of those cases where I oscillate between thinking what I’m saying about it is radically new and thinking ho-hum-we-already-thought-of-that.)
  • While we were working on this manuscript, I became aware that a group led by Dustin Wood was looking at very similar issues (but with some interesting differences in approach and areas of non-overlap). They’ve got a paper in press at JPSP.

If you want to read more you can download the PDF:

Srivastava, S., Guglielmo, S., & Beer, J. S. (2010). Perceiving others’ personalities: Examining the dimensionality, assumed similarity to the self, and stability of perceiver effects. Journal of Personality and Social Psychology, 98, 520-534.