The easiest way to talk to someone else is face-to-face. If you can see the movements of a person’s lips and facial muscles, you can more easily work out what they’re saying, a fact made obvious if you’re trying to have a conversation in a noisy environment. These visual cues clue our brains in on how best to interpret the signals coming from our ears.
But what happens when that’s not possible, like when you’re chatting on the phone or listening to a recorded message? New research suggests that if you’ve spoken to someone before, your brain uses memories of their face to help decode what they’re saying when they’re not in front of you. Based on previous experience, It runs a simulation of the speaker’s face to fill in any information missing from the sound stream alone.
These results contradict a classical theory about hearing – the “auditory-only model” – which suggest that the brain deciphers the spoken word using only the signals it receives from the ears. The model has been opposed before, by earlier studies which found that people are better at identifying a speaker by voice if they have briefly seen that person speaking before. Katherina von Kriegstein from University College London extended these discoveries by showing that previous experience also helps us to work out what’s being said, as well as who said it.
She trained 34 volunteers to identify six male speakers by voice and name. The volunteers saw videos of three of the speakers as they talked, but the other three remained faceless, represented only by a drawing of their occupation. As a further catch, half of the volunteers had a condition called prosopagnosia or face blindness, that prevents them from recognising faces, but has no effect on their ability to recognise objects in general.
After the training, the Kriegstein tested the volunteers while they lay inside a magnetic resonance imaging (MRI) scanner. They listened to short recordings of one of the six speakers and had to either work out who was speaking (“speaker recognition”) or what they were saying (“speech recognition”).
Kriegstein found that both prosopagnosics and controls were slightly better at recognising speech when they had seen the speaker’s face before. The improvement was small – between 1-2% – but that is still significant given that typical success rate for this task is greater than 90%. But of the two groups, only the controls were better at recognising speakers after seeing videos of them beforehand. They were 5% more accurate, while the prosopagnosics didn’t benefit at all.
Using the fMRI scanner, Kriegstein found that these ‘face benefits’ were reflected by the strength of neural activity in two parts of the brain. The first, the superior temporal sulcus (STS) detects facial movements (among other biological motion), of the kind that we use to help us make out the words of a person speaking in front of us. The stronger their activity in the STS, the more benefit the volunteers gained from having seeing videos of the speakers in the speech recognition task.
The second area, the fusiform face area (FFA), specialises in recognising faces and is often damaged in prosopagnosics. Unlike the STS, it played more of a role in the speaker recognition task but only the controls were more accurate at identifying speakers if they had strong activity in the FFA. So two separate networks that are involved in facial processing are active even when there are no faces to process.
Kriegstein concluded that the people pick up key visual elements of a stranger’s speech after less than two minutes of watching them talk, and we use these to store ‘facial signatures’ of new speakers. The brain effectively uses these to run ‘talking face’ simulations, to better decipher any voice it hears. It’s one of the reasons why phone conversations are easier if you’ve previously met the person at the other end of the line in the flesh.
Image: by Xenia