Artificial Intelligence Generates Humans’ Faces Based on Their Voices
In trials, the algorithm successfully pinpointed speakers’ gender, race and age
A new neural network developed by researchers from the Massachusetts Institute of Technology is capable of constructing a rough approximation of an individual’s face based solely on a snippet of their speech, a paper published in pre-print server arXiv reports.
The team trained the artificial intelligence tool—a machine learning algorithm programmed to “think” much like the human brain—with the help of millions of online clips capturing more than 100,000 different speakers. Dubbed Speech2Face, the neural network used this dataset to determine links between vocal cues and specific facial features; as the scientists write in the study, age, gender, the shape of one’s mouth, lip size, bone structure, language, accent, speed and pronunciation all factor into the mechanics of speech.
According to Gizmodo’s Melanie Ehrenkranz, Speech2Face draws on associations between appearance and speech to generate photorealistic renderings of front-facing individuals with neutral expressions. Although these images are too generic to identify as a specific person, the majority of them accurately pinpoint speakers’ gender, race and age.
Interestingly, Jackie Snow explains for Fast Company, the new research not only builds on previous research regarding predictions of age and gender from speech, but also spotlights links between voice and “craniofacial features” such as nose structure.
The authors add, “This is achieved with no prior information or the existence of accurate classifiers for these types of fine geometric features.”
Still, the algorithm has its flaws. As Live Science’s Mindy Weisberger notes, the model has trouble analyzing language variations. When played an audio clip of an Asian man speaking Chinese, for example, Speech2Face produced a face of the correct ethnicity, but when the same individual was recorded speaking English, the AI generated an image of a white man.
In other cases, high-pitched males, including children, were erroneously identified as females, revealing the model’s gender bias in associating low-pitched voices with men and high-pitched ones with women. Given the fact that the training data was largely derived from educational videos posted on YouTube, the researchers further point out that the algorithm fails to “represent equally the entire world population.”
According to Slate’s Jane C. Hu, the legality of using YouTube videos for scientific research is fairly clear cut. Such clips are considered publicly available information; even if a user copyrights their videos, scientists can include the materials in their experiments under a “fair use” clause.
But the ethics of this practice are less straightforward. Speaking with Hu, Nick Sullivan, head of cryptography at Cloudflare, said he was surprised to see a photo of himself featured in the MIT team’s study, as he had never signed a waiver or heard directly from the researchers. Although Sullivan tells Hu it would have been “nice” to be notified of his inclusion in the database, he acknowledges that given the sheer size of the data pool, it would be difficult for the scientists to reach out to everyone depicted.
At the same time, Sullivan concludes, “Since my image and voice were singled out as an example in the Speech2Face paper, rather than just used as a data point in a statistical study, it would have been polite to reach out to inform me or ask for my permission.”
One potential real-world application for Speech2Face is using the model to “attach a representative face” to phone calls on the basis of a speaker’s voice. Snow adds that voice recognition technology is already used across a number of fields—often without individuals’ express knowledge or consent. Last year, Chase launched a “Voice ID” program that learns to recognize credit card customers calling the bank, while correctional institutions across the country are building databases of incarcerated individuals’ “voiceprints.”