A pioneering study by The Hong Kong Polytechnic University reveals that incorporating sensory inputs in large language models significantly enhances their ability to understand complex human concepts, drawing closer parallels to human cognition.
Researchers led by The Hong Kong Polytechnic University (PolyU) have uncovered how large language models (LLMs) can form complex conceptual knowledge more similarly to humans when enriched with sensory and motor inputs.
The study, led by Li Ping, the Sin Wai Kin Foundation Professor in Humanities and Technology and dean of the PolyU Faculty of Humanities, explored the similarities between LLMs and human conceptual representation.
The findings, published in the journal Nature Human Behaviour, suggest that while LLMs trained solely on language exhibit limitations, those integrated with sensory inputs show a more nuanced understanding akin to human cognition.
By comparing data from state-of-the-art LLMs like ChatGPT (GPT-3.5, GPT-4) and Google’s PaLM and Gemini with human-generated word ratings, the research revealed that LLMs align well with human understanding in non-sensorimotor dimensions but struggle with sensory and motor-related concepts. This points to the necessity of sensory grounding in refining AI models.
“The availability of both LLMs trained on language alone and those trained on language and visual input, such as images and videos, provides a unique setting for research on how sensory input affects human conceptualisation,” Li said in a news release. “Our study exemplifies the potential benefits of multimodal learning, a human ability to simultaneously integrate information from multiple dimensions in the learning and formation of concepts and knowledge in general.”
The significance of this research lies in the potential applications for future AI development. With improved multimodal learning, AI systems could perform more human-like tasks, from interpreting data to executing physical actions.
The researchers propose that future LLMs equipped with integrated sensory inputs through humanoid robotics could revolutionize fields such as autonomous robotics, natural language processing and cognitive computing.
“The smooth, continuous structure of embedding space in LLMs may underlie our observation that knowledge derived from one modality could transfer to other related modalities. This could explain why congenitally blind and normally sighted people can have similar representations in some areas. Current limits in LLMs are clear in this respect,” added Li Ping.
The study provides a clear path forward for advancing AI technologies, emphasizing the role of multimodal input in achieving more sophisticated and human-like artificial intelligence.
“These advances may enable LLMs to fully capture embodied representations that mirror the complexity and richness of human cognition, and a rose in LLM’s representation will then be indistinguishable from that of humans,” concluded Li.
Co-authors of the study include experts from Ohio State University, Princeton University and City University of New York.

