Someday, people who have lost their ability to speak may get their voice back. A new study demonstrates that electrical activity in the brain can be decoded and used to synthesize speech.
The study, published on Wednesday in Nature, reported data from five patients whose brains were already being monitored for epileptic seizures, with stamp-size arrays of electrodes placed directly on the surfaces of their brains.
As the participants read off hundreds of sentences—some from classic children's stories such as Sleeping Beauty and Alice in Wonderland—the electrodes monitored slight fluctuations in the brain's voltage, which computer models learned to correlate with their speech. This translation was accomplished through an intermediate step, which connected brain activity with a complex simulation of a vocal tract—a setup that builds on recent studies that found the brain's speech centers encode the movements of lips, tongue, and jaw.
“It's a very, very elegant approach,” says Christian Herff, a postdoctoral researcher at Maastricht University who studies similar brain-activity-to-speech methods.
The device marks the latest in a rapidly developing effort to map the brain and engineer methods of decoding its activity. Just weeks ago, a separate team including Herff published a model in the Journal of Neural Engineering that also synthesized speech from brain activity using a slightly different approach, without the simulated vocal tract.
“Speech decoding is an exciting new frontier for brain-machine interfaces,” says the University of Michigan's Cynthia Chestek, who was not involved in either study. “And there is a subset of the population that has a really big use for this.”
Both teams, as well as other researchers around the world, hope to help people who have been robbed of their ability to speak by conditions such as amyotrophic lateral sclerosis (ALS)—the neurodegenerative disorder known as Lou Gehrig's disease—and strokes. Though their brains' speech centers remain intact, patients are left unable to communicate, locked away from the world around them.
Past efforts focused on harnessing brain activity to allow patients to spell out words one letter at a time. But these devices' typing speeds top out at around eight words per minute—nowhere near natural speech, which rushes by at around 150 words per minute.
“The brain is the most efficient machine that has evolved over millennia, and speech is one of the hallmarks of behavior of humans that sets us apart from even all the non-human primates,” says Nature study coauthor Gopala Anumanchipalli of the University of California, San Francisco. “And we take it for granted—we don’t even realize how complex this motor behavior is.”
While the studies' results are encouraging, it will take years of further work before the technology is made available for patients' use and adapted to languages other than English. And these efforts are unlikely to help people who suffered from damage to the speech centers of the brain, such as some traumatic brain injuries or lesions. Researchers also stress that these systems do not equate to mind-reading: The studies monitored only the brain regions that orchestrate the vocal tract's movements during conscious speech.
“If I'm just thinking, 'Wow, this is a really tough day,' I'm not controlling my facial muscles,” says Herff. “Meaning is not what we are decoding here.”
Eavesdropping on the brain
To translate thoughts into sentences, Anumanchipalli and his colleagues used electrodes placed directly on the brain's surface. Though invasive, this direct monitoring is key to success. “Because the skull is really hard and it actually acts like a filter, it doesn’t let all the rich activity that's happening underneath come out,” Anumanchipalli says.
Once they collected high-resolution data, researchers then piped the recorded signals through two artificial neural networks, which are computer models that roughly mimic brain processes to find patterns in complex data. The first network inferred how the brain was signaling the lips, tongue, and jaws to move. The second converted these motions into synthetic speech, training the model using recordings of the participants' speech.
Next came the true test: Could other humans understand the synthetic speech? For answers, researchers recruited a group of 1,755 English speakers using Amazon's Mechanical Turk platform. Subgroups of these listeners were assigned to 16 different tasks to judge the intelligibility of both words and sentences.
The brain is the most efficient machine that has evolved over millennia, and speech is one of the hallmarks of behavior of humans that sets us apart from even all the non-human primates.
Participants listened to 101 sentences of synthesized speech and then tried to transcribe what they heard, choosing from a group of 25 or 50 words. They were correct 43 and 21 percent of the time, respectively, depending on the number of words to choose from.
Not every clip was equally intelligible. Some simple sentences, such as “Is this seesaw safe?,” got perfect transcriptions every time. But more complicated sentences, such as “At twilight on the twelfth day, we'll have Chablis,” came out perfectly less than 30 percent of the time.
Some sounds were also more easily decoded than others. Sustained signals, such as the sh in “ship,” came through the analysis cleanly, while sharp bursts of noise—such as the b in “bat”—were smoothed-over and muddled.
While the output isn't perfect, Chestek points out that the data used to train the system is still fairly small. “Arguably they’re still kind of operating with one hand behind their back because they’re limited to epilepsy surgeries and epilepsy patients,” she says, adding that potential future systems implanted solely for brain-to-speech translation could be slightly more optimized. “I’m cautiously very excited about this.”
The Nature study's authors used a two-step process to make their synthesized speech that much clearer. But in principle, it's feasible to go straight from brain activity to speech without using the simulated vocal tract as an in-between, as shown in the Journal of Neural Engineering study.
In that work, researchers recorded the brain activity and speech of six people undergoing surgery to remove brain tumors, using an on-brain electrode grid similar to the one in the Nature study. The team then trained a neural network to find the associations between each participant's spoken words and brain activity, designing the system so that it could work with just eight to 13 minutes of input audio—all the data they could collect mid-surgery.
“You just have to imagine how stressful the situation is: The surgeon opens up the skull and then places this electrode grid directly, and they do it to map where the cancer stops and where the important cortex [brain matter] starts,” says Herff. “Once they finish that, they have to calculate what to cut out—and during that interval, our data is being recorded.”
Researchers next fed the neural network's output into a program that converted it into speech. Unlike the Nature study, which attempted to synthesize full sentences, Herff and his colleagues focused on synthesizing individual words.
It's tough to directly compare how the two methods performed, emphasized Northwestern University's Marc Slutzky, a coauthor of the Journal of Neural Engineering study. But they do show some similarities. “From the few metrics we used in common,” he says, “they seem to be somewhat similar in performance—at least for some of the subjects.”
Considerable hurdles remain before this technology ends up in the hands—or brains—of patients. For one, both studies' models are based on people who can still speak, and they haven't yet been tested in people who once spoke but no longer can.
“There's a very fundamental question ... whether or not the same algorithms will work,” says Nature study coauthor Edward Chang, a neurological surgery professor at the University of California, San Francisco. “But we're getting there; we're getting close to it.”
Anumanchipalli and his team tried to address this in some trials by training on participants who did not vocalize, but instead just silently mouthed sentences. While this successfully generated synthetic speech, the clips were less accurate than ones based on audibly spoken inputs. What's more, miming still requires the patients to be able to move their face and tongue—which isn't a given for people suffering neurological issues that limit their speech.
“For the patients that you’re most interested on [using] this in, it’s not really going to help,” Slutzky says of the miming trials. While he sees the work as a strong demonstration of current possibilities, the field as a whole still struggles to make the leap to people who no longer can speak.
The hope is that future brain-speech interfaces can adapt to their users, as the users themselves adapt to the device—while also retaining control over the interfaces, as well as a semblance of the privacy that able-bodied people routinely enjoy in their speech. For instance, how do users maintain control over their data, such as the personalized vocabulary their systems build up over time?
“You can turn off that [smartphone] feature, but what if you don't have that physical control?” asks Melanie Fried-Oken, a speech-language pathologist at Oregon Health & Science University and expert on assistive speech technologies. “At what level do you want privacy and identity impinged upon for the function of communication? We don't know the answers.”
In decades to come, people with disorders like cerebral palsy, who often lack control of their speech muscles from early on, may well grow up with the devices from childhood—helping organize their brains for speech from the beginning.
“Wouldn’t it be great to be able to give this to a three-year-old who can now interact with the environment, who hasn't been able to do it yet?” Fried-Oken says. “Just like we're giving cochlear implants to [deaf] infants—the same! There's such potential here, but there's so many neuroethical issues.”