A study led by Johns Hopkins University researchers highlights the significant gap between human and artificial intelligence abilities in understanding social interactions from moving scenes. This discovery underscores the challenges of creating AI that can effectively interact with humans in dynamic environments.
Artificial intelligence has come a long way in recent years, excelling at tasks like image recognition and language processing. However, when it comes to understanding social interactions in dynamic environments, humans still have the upper hand. According to a new study led by scientists at Johns Hopkins University, current AI models struggle to interpret the social dynamics and contexts necessary for effective interaction with people.
“AI for a self-driving car, for example, would need to recognize the intentions, goals and actions of human drivers and pedestrians. You would want it to know which way a pedestrian is about to start walking, or whether two people are in conversation versus about to cross the street,” lead author Leyla Isik, an assistant professor of cognitive science at Johns Hopkins University, said in a news release. “Any time you want an AI to interact with humans, you want it to be able to recognize what people are doing. I think this sheds light on the fact that these systems can’t right now.”
The research, which was presented at the International Conference on Learning Representations on April 24, involved asking human participants to watch three-second video clips and rate features crucial for understanding social interactions. These clips showed individuals either interacting with each other, engaging in side-by-side activities, or performing independent actions.
The researchers then had more than 350 AI models — including language, video and image models — predict how humans would rate the videos and how their brains would respond to watching these scenes.
The study’s findings were striking. While human participants generally agreed on their assessments, AI models failed to match human perceptions.
Notably, video models were particularly ineffective at describing what people were doing, and even image models analyzing still frames could not reliably predict whether individuals were communicating.
Interestingly, language models fared better at predicting human behavior, though video models were more adept at predicting neural activity in the brain.
“It’s not enough to just see an image and recognize objects and faces. That was the first step, which took us a long way in AI. But real life isn’t static. We need AI to understand the story that is unfolding in a scene. Understanding the relationships, context and dynamics of social interactions is the next step, and this research suggests there might be a blind spot in AI model development,” added co-first author Kathy Garcia, a doctoral student who co-authored the study while working in Isik’s lab.
The implications of this study are profound for technologies relying on AI to navigate the real world, such as self-driving cars and assistive robots. The ability to understand social interactions is crucial for these systems to function safely and efficiently.
The researchers attribute this shortcoming to the design of AI neural networks, which are inspired by the part of the human brain that processes static images rather than dynamic social scenes.
“There’s a lot of nuances, but the big takeaway is none of the AI models can match human brain and behavior responses to scenes across the board, like they do for static scenes,” Isik added. “I think there’s something fundamental about the way humans are processing scenes that these models are missing.”
Source: Johns Hopkins University

