LLMs and Theory of Mind: Comparative Analysis

In a paper published in the journal Nature Human Behaviour, researchers compared the performance of humans and large language models (LLMs) in the theory of mind tasks. Through extensive testing, they found that while generative pre-trained transformer 4 (GPT-4) models often excelled in identifying indirect requests, false beliefs, and misdirection, they struggled with detecting faux pas.

Study: LLMs and Theory of Mind: Comparative Analysis. Image Credit: Phalexaviles/Shutterstock
Study: LLMs and Theory of Mind: Comparative Analysis. Image Credit: Phalexaviles/Shutterstock

Conversely, LLM assessment 2 (LLaMA2) exhibited superior performance in faux pas detection, though subsequent analyses revealed this advantage to be illusory. These findings demonstrated LLMs' ability to approximate human-like behavior in mentalistic inference, highlighting the importance of systematic testing for comprehensive comparisons between human and artificial intelligence (AI).


Previous works have highlighted the significance of the theory of mind—the ability to understand others' mental states—in human social interactions. This capacity underpins communication, empathy, and decision-making. LLMs like GPT have shown promise in mimicking aspects of the theory of mind. However, concerns persist about their robustness and interpretability. There is an increasing demand for a systematic experimental approach akin to machine psychology to investigate LLM capabilities. 

Research Overview and Methodology

The research was conducted according to approved ethical standards and guidelines outlined in the Helsinki Declaration, overseen by a local ethical committee. It involved testing various versions of OpenAI's GPT models, including versions 3.5 and 4, alongside LLaMA2-Chat models. The team utilized LLaMA2-Chat models with set parameters and Langchain's conversation chain to establish memory context within chat sessions.

Participants recruited online were native English speakers aged between 18 and 70, devoid of psychiatric conditions or dyslexia history, and received compensation for their involvement. The study encompassed many theory of mind tests to assess participants' social cognition abilities, such as false belief, irony, faux pas, hinting, and strange stories. Rigorous coding procedures were employed for response evaluation, ensuring consistency among experimenters.

Statistical analyses, including Wilcoxon and Bayesian tests, were utilized to compare LLMs' performance against human benchmarks across various theory of mind tests. Furthermore, the analysts introduced novel test items to evaluate LLMs' comprehension beyond familiar scenarios. A belief likelihood test manipulated the likelihood of speakers' knowledge in faux pas scenarios, with subsequent analyses scrutinizing response distributions using chi-square tests and Bayesian approaches.

Theory of Mind Evaluation

The study evaluates LLMs' comprehension abilities regarding the theory of mind through tests like hinting, false belief, faux pas, irony, and strange stories. LLMs, including GPT-4, GPT-3.5, and LLaMA2-70B, were tested alongside human participants, each taking part in 15 chat sessions. Performance was assessed based on their understanding of the characters' intentions, beliefs, and emotions in the provided scenarios. Both original and novel test items were utilized to ensure a fair evaluation, with responses scored against human benchmarks.

In the false belief test, where understanding others' beliefs differs from reality, humans and LLMs performed exceptionally well, indicating a strong grasp of the theory of mind. However, in the irony test, GPT-4 demonstrated superior performance to humans, while GPT-3.5 and LLaMA2-70B struggled to recognize ironic statements accurately. Faux pas, which tests sensitivity to social norms and unintended remarks, saw varied performances among models, with GPT-4 lagging behind humans and LLaMA2-70B surprisingly outperforming them.

Further analyses explored why LLMs, particularly GPT-4, struggled with certain tests. In the faux pas scenario, while LLMs could identify the occurrence of a social misstep, they often hesitated to attribute intent or knowledge to the characters involved. This hesitation was attributed to an overly cautious approach rather than a lack of understanding. Subsequent tests framed questions regarding likelihood, revealing that while GPT-4 could infer intentions accurately, it tended to avoid committing to specific interpretations, showcasing a nuanced but cautious understanding.

Additional variants of the faux pas test were introduced to validate these findings by manipulating the likelihood that characters were aware of their actions. The results mirrored those of the original tests, supporting that LLMs exhibit a nuanced understanding of social scenarios but tend towards conservative responses when asked to make explicit judgments. Overall, the study sheds light on the intricate interplay between language comprehension and social cognition in AI models, highlighting their capabilities and limitations in understanding human-like behavior.


To sum up, the study provided valuable insights into the theory of mind comprehension abilities of LLMs, including GPT-4, GPT-3.5, and LLaMA2-70B. While these models demonstrated impressive capabilities in understanding various social scenarios, their performance varied across tests, indicating nuanced but sometimes cautious comprehension of human-like behavior. The findings underscored the need for further research to refine AI models' understanding of complex social dynamics and improve their ability to interpret and respond to nuanced human interactions accurately.

In conclusion, the study highlighted the intricate interplay between language comprehension and social cognition in AI models. By evaluating their performance on theory-of-mind tests, the research contributed to the understanding of LLMs' strengths and limitations in understanding and responding to human-like behavior. Continuing investigation and refinement of these models are essential to enhancing their ability to navigate complex social scenarios accurately.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, May 29). LLMs and Theory of Mind: Comparative Analysis. AZoAi. Retrieved on July 17, 2024 from https://www.azoai.com/news/20240529/LLMs-and-Theory-of-Mind-Comparative-Analysis.aspx.

  • MLA

    Chandrasekar, Silpaja. "LLMs and Theory of Mind: Comparative Analysis". AZoAi. 17 July 2024. <https://www.azoai.com/news/20240529/LLMs-and-Theory-of-Mind-Comparative-Analysis.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "LLMs and Theory of Mind: Comparative Analysis". AZoAi. https://www.azoai.com/news/20240529/LLMs-and-Theory-of-Mind-Comparative-Analysis.aspx. (accessed July 17, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2024. LLMs and Theory of Mind: Comparative Analysis. AZoAi, viewed 17 July 2024, https://www.azoai.com/news/20240529/LLMs-and-Theory-of-Mind-Comparative-Analysis.aspx.


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Predicting Jack Speed and Torque of a Tunnel Boring Machine Using Artificial Intelligence