Even with vast training data, artificial intelligence models struggle to grasp the multisensory richness of human concepts, highlighting a fundamental gap between human and machine understanding.
Research: Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts. Image Credit: KieferPix / Shutterstock
Even with all its training and computer power, an artificial intelligence (AI) tool like ChatGPT can't represent the concept of a flower the way a human does, according to a new study.
That's because the large language models (LLMs) that power AI assistants are usually based on language alone, and sometimes with images.
"A large language model can't smell a rose, touch the petals of a daisy or walk through a field of wildflowers," said Qihui Xu, lead author of the study and postdoctoral researcher in psychology at The Ohio State University.
"Without those sensory and motor experiences, it can't truly represent what a flower is in all its richness. The same is true of some other human concepts."
The study was published this week in the journal Nature Human Behaviour.
Xu said the findings have implications for the relationship between AI and humans.
"If AI construes the world in fundamentally different way from humans, it could affect how it interacts with us," she said.
Xu and her colleagues compared humans and LLMs in their knowledge representation of 4,442 words – everything from "flower" and "hoof" to "humorous" and "swing."
They compared the similarity of representations between humans and two state-of-the-art large language model (LLM) families from OpenAI (GPT-3.5 and GPT-4) and Google (PaLM and Gemini).
Humans and LLMs were tested on two measures. One, called the Glasgow Norms, asks for ratings of words on nine dimensions, such as arousal, concreteness, and imageability. For example, the measure asks for ratings of how emotionally arousing a flower is, and how much one can mentally visualize a flower (or how imageable it is).
The other measure, called Lancaster Norms, examined how concepts of words are related to sensory information (such as touch, hearing, smell, and vision) and motor information, which is involved in actions, including those performed through contact with the mouth, hand, arm, and torso.
For example, the measure asks for ratings on how much one experiences flowers by smelling and how much one experiences flowers through actions involving the torso.
The goal was to determine how LLMs and humans aligned in their ratings of the words. In one analysis, the researchers examined the correlation between humans and AI on certain concepts. For example, do the LLMs and humans agree that some concepts have higher emotional arousal than others?
In a second analysis, researchers investigated how humans compared to LLMs on deciding how different dimensions may jointly contribute to a word's overall conceptual representation and how different words are interconnected.
For example, the concepts of pasta and roses might both receive high ratings for their strong sensory appeal, particularly in terms of the sense of smell. However, pasta is considered more similar to noodles than to roses, at least for humans, not just because of its smell, but also its visual appearance and taste.
Overall, the LLMs performed very well compared to humans in representing words that had no connection to the senses or motor actions. However, when it came to words that have connections to things we see, taste, or interact with using our bodies, that's where AI failed to capture human concepts.
"From the intense aroma of a flower, the vivid silky touch when we caress petals, to the profound joy evoked, human representation of 'flower' binds these diverse experiences and interactions into a coherent category," the researchers say in the paper.
The issue is that most LLMs are dependent on language, and "language by itself can't fully recover conceptual representation in all its richness," Xu said.
Even though LLMs can approximate some human concepts, particularly when they don't involve senses or motor actions, this kind of learning is not efficient.
"They obtain what they know by consuming vast amounts of text – orders of magnitude larger than what a human is exposed to in their entire lifetimes – and still can't quite capture some concepts the way humans do," Xu said.
"The human experience is far richer than words alone can hold."
However, Xu noted that LLMs are continually improving, and it's likely they will become better at capturing human concepts. The study found that LLMs trained on images, as well as text, performed better than text-only models in representing concepts related to vision.
And when future LLMs are augmented with sensor data and robotics, they may be able to make inferences about and actively act upon the physical world, she said.
Co-authors on the study were Yingying Peng, Ping Li, and Minghua Wu of the Hong Kong Polytechnic University; Samuel Nastase of Princeton University; and Martin Chodorow of the City University of New York.
Source:
Journal reference:
- Xu, Q., Peng, Y., Nastase, S. A., Chodorow, M., Wu, M., & Li, P. (2025). Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts. Nature Human Behaviour, 1-16. DOI: 10.1038/s41562-025-02203-8, https://www.nature.com/articles/s41562-025-02203-8