AI Falters in Language Comprehension as Humans Maintain the Lead

Despite their fluency, AI language models stumble on basic comprehension tasks, revealing critical gaps compared to human understanding.

Human vs AIResearch: Testing AI on language comprehension tasks reveals insensitivity to underlying meaning. Image Credit: Rob Hyrons / Shutterstock

In a paper published in the journal Scientific Reports, researchers assessed seven state-of-the-art large language models (LLMs) on a new benchmark, discovering that these models performed at chance accuracy and produced inconsistent answers. Humans outperformed the models both quantitatively and qualitatively.

The study indicated that current LLMs still need more human-like language understanding, a shortfall attributed to the absence of a "compositional operator" for effectively handling grammatical and semantic information. This operator is essential for mapping linguistic structures to meanings in a way that generalizes across contexts.

Background

Past work highlighted that LLMs excel in tasks ranging from translation to answering domain-specific queries, such as in law or medicine. Despite their fluency, they often need help with more straightforward linguistic tasks, revealing inconsistencies in language understanding compared to humans.

Researchers questioned whether models truly grasp meaning or merely predict tokens based on data patterns, pointing out that errors in comprehension can have significant real-world consequences, such as misleading chatbot interactions that could impact industries like customer service or healthcare.

LLM Comprehension Evaluation

The study evaluated the language comprehension abilities of seven LLMs using a set of 40 comprehension questions designed to minimize grammatical complexity. These prompts included only affirmative sentences, avoided negations, and used common verbs to reduce ambiguity. Each prompt was tested multiple times to assess answer stability, with models responding in open-length and one-word settings. Human performance was compared using the same questions administered to 400 English-speaking participants, equally split by gender, recruited from the Prolific platform.

Each LLM was tested in December 2023 through OpenAI, Google, and HuggingFace interfaces. Models like ChatGPT-3.5, ChatGPT-4, Bard, and Gemini, which leverage reinforcement learning from human feedback (RLHF), were included. Prompts were randomized and presented in both settings to ensure robust comparisons.

Accuracy was coded leniently, favoring the models where possible to gauge their best performance. For instance, ambiguous answers in the open-length condition were marked correct if they contained no clear errors, even if they lacked precision. A total of 1,680 LLM replies were analyzed, with models prompted three times per question to mirror human testing.

The human study, approved by the ethics committee at Humboldt-Universität zu Berlin, involved 400 participants tested under similar conditions. Each participant answered 20 prompts, each repeated thrice, resulting in 24,000 replies.

Participants were divided into open-length and one-word groups. Each question was administered in random order alongside two attention checks.

Responses were coded for accuracy and stability, and those who failed attention checks were excluded. Human replies were collected using the jsPsych toolkit, with a median experiment completion time of 13.4 minutes.

The researchers also highlighted how systematic testing allowed comparisons between the stability of human and LLM responses under identical conditions, revealing key performance gaps.

Human-LLM Performance Comparison

The study compared the language comprehension of seven LLMs and human participants, focusing on accuracy and stability. Accuracy analyses used generalized linear mixed effect models (GLMMs) to evaluate model performance. Statistical testing confirmed that LLMs, as a group, performed at chance accuracy, with significant variability between models. ChatGPT-4 emerged as the most accurate LLM, outperforming others significantly.

The study revealed higher accuracy in one-word responses compared to open-length settings. Falcon and ChatGPT-4 were noted for consistently providing accurate responses. However, LLM Meta AI 2 (Llama2) and Mixtral showed chance-level performance, with Bard's accuracy dropping below chance. In contrast, humans performed above chance regardless of response type, indicating robust and contextually grounded comprehension abilities.

Stability assessments measured the consistency of answers across repeated prompts, coded as stable if identical or unstable if varied. Stability varied significantly between models, with Falcon proving to be the most stable. Bard and Mixtral demonstrated lower consistency, while Gemini displayed stability despite providing inaccurate answers.

A setting effect was observed, with responses being more stable in the one-word condition. Comparatively, LLMs were less stable than humans, especially in open-length settings. This highlights the inability of LLMs to replicate the inherent consistency of human language comprehension.

Comparative analyses between humans and LLMs highlighted notable differences. Humans outperformed LLMs in accuracy and stability, even when comparing top performers like ChatGPT-4 to humans achieving ceiling performance. Statistical models revealed that the performance gap between LLMs and humans widened in open-length settings, while the one-word setting reduced this discrepancy.

Despite ChatGPT-4's high performance, it did not match the best human participants. The data suggested that humans maintained superior comprehension and stability, even when LLMs benefited from favorable coding rules. For instance, humans provided concise, error-free answers aligned with task instructions, while LLMs frequently added redundant or irrelevant content.

Conclusion

To sum up, the study revealed that while LLMs demonstrated utility in various tasks, they performed at chance accuracy on a language comprehension benchmark. Their responses were inconsistent and included non-human errors, suggesting critical limitations in understanding linguistic meaning beyond surface-level patterns.

The results indicated that current AI models lack the necessary compositional operators for effectively regulating grammatical and semantic information. These findings call for a reevaluation of claims about LLMs achieving human-like linguistic capabilities, particularly when applied in real-world contexts where misinterpretation can have serious consequences.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, November 19). AI Falters in Language Comprehension as Humans Maintain the Lead. AZoAi. Retrieved on December 12, 2024 from https://www.azoai.com/news/20241119/AI-Falters-in-Language-Comprehension-as-Humans-Maintain-the-Lead.aspx.

  • MLA

    Chandrasekar, Silpaja. "AI Falters in Language Comprehension as Humans Maintain the Lead". AZoAi. 12 December 2024. <https://www.azoai.com/news/20241119/AI-Falters-in-Language-Comprehension-as-Humans-Maintain-the-Lead.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "AI Falters in Language Comprehension as Humans Maintain the Lead". AZoAi. https://www.azoai.com/news/20241119/AI-Falters-in-Language-Comprehension-as-Humans-Maintain-the-Lead.aspx. (accessed December 12, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2024. AI Falters in Language Comprehension as Humans Maintain the Lead. AZoAi, viewed 12 December 2024, https://www.azoai.com/news/20241119/AI-Falters-in-Language-Comprehension-as-Humans-Maintain-the-Lead.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.