A.I. Learns Words From a Human Baby’s Perspective, Using Headcam Footage
With only limited training, the model could correctly identify certain objects, suggesting some elements of learning language are not innate to humans
Researchers have long wondered whether human babies have an innate ability to learn language that helps them grasp words’ meanings and ultimately comprehend sentences. Most language-processing artificial intelligence models, such as ChatGPT, are trained on millions or trillions of items before they can function. Children, however, learn the basics of language after hearing far fewer words.
But a recent study suggests A.I. might also be able to acquire some language with a smaller set of clues. Trained on visuals and words captured by a headcam on a young child, an artificial intelligence model learned to correctly match some images of objects with their names, researchers report this month in the journal Science.
“Today’s models don’t need as much input as they’re getting in order to make meaningful generalizations,” Brenden Lake, a co-author of the study and computational cognitive scientist at New York University (NYU), tells Scientific American’s Lauren Leffer. “We showed, for the first time, that you can train an A.I. model to learn words through the eyes and ears of a single child.”
The A.I.’s success with this limited input suggests an innate understanding may not be needed for some aspects of language acquisition.
“That, for me, really does shake up my worldview,” Jess Sullivan, a developmental psychologist at Skidmore College who helped collect the data but did not participate the new study, tells MIT Technology Review’s Cassandra Willyard.
The study took advantage of an existing library of camera footage that captured one baby’s experience. A child named Sam wore a headcam for 61 hours between the ages of six months and 25 months, representing about 1 percent of his waking hours. The camera recorded video and audio while the child engaged in activities including playing, eating and reading.
The camera captured around 250,000 different words. The researchers fed their model individual video frames and a transcript of things people said at that time.
Importantly, the model was just a generic, simple neural network. “There’s not anything inbuilt into the network giving the model clues about language or how language ought to be structured,” Wai Keen Vong, a co-author of the study and research scientist at NYU, tells the Washington Post’s Carolyn Y. Johnson.
Children start learning their first words between the ages of six months and nine months—and making these connections is a complex problem. For instance, babies have to learn the word “cup” refers to a container for holding a drink, rather than, more generally, any object with a hole in it or any object that’s the same color as their cup, Vong explains in a video.
“There’s an infinite number of possible meanings for any word,” Lake tells MIT Technology Review.
Despite the challenge of its task, the A.I. had some success. To test its comprehension, the researchers presented the model with four images from the headcam footage and asked it which image corresponded to a specific word. It identified the correct image 62 percent of the time, according to Nature News’ Elizabeth Gibney.
The model was also able to match words it had learned to representative images that it had never seen before, such as a generic image of an apple, about 35 percent of the time.
Words including “car” and “crib” were frequently recognized by the model, but the A.I. struggled with objects that might have more variation, such as “room” and “toy.”
The findings demonstrate how some aspects of word meaning are learnable by association, the study authors write.
“I was one of the people who thought that the problem of learning language is infinitely complex and that it wouldn’t be possible to learn a word’s meaning without having some specific machinery built into your mind,” Sullivan tells Scientific American. “Now I see that, in at least one case, it is possible.”
Still, some scientists point to a few drawbacks of the research. “There are still some things it’s harder to conclude from the paper exactly—what this tells us about how children actually learn words is less clear,” Joshua Tenenbaum, a computational cognitive scientist at MIT who was not involved in the research, tells the Washington Post.
The study also focused solely on the names of objects—it didn’t show that the A.I. could learn about verbs, structure and other aspects of language through the headcam footage, Eva Portelance, a computational linguistics researcher at Mila-Quebec Artificial Intelligence Institute in Canada who did not contribute to the findings, notes to Scientific American.
But with future research, similar studies of learning could pull back the curtain on the mysterious process of language acquisition.
“The potential for further refinements to make the model more aligned with the complexities of human learning is vast, offering exciting avenues for advancements in cognitive sciences,” Anirudh Goyal, a machine learning scientist at the University of Montreal in Canada who was not involved in the research, tells Nature News.