AI Moderation Systems Show Striking Inconsistencies In Detecting Hate Speech

Download PDF Copy

University of PennsylvaniaSep 14 2025

A new research paper uncovers how leading AI systems disagree on what counts as hate speech, raising urgent concerns about fairness, trust, and protection from online harm.

Research: Model-Dependent Moderation: Inconsistencies in Hate Speech Detection Across LLM-based Systems. Image Credit: 3d Stock Hub / Shutterstock

With the proliferation of online hate speech, which research shows can increase political polarization and damage mental health, leading artificial intelligence companies have released large language models that promise automatic content filtering. "Private technology companies have become the de facto arbiters of what speech is permissible in the digital public square, yet they do so without any consistent standard," says Yphtach Lelkes, associate professor in the Annenberg School for Communication.

New Research Compares AI Moderation Systems

He and Annenberg doctoral student Neil Fasching have produced the first large-scale comparative analysis of AI content moderation systems—which social media platforms employ—and tackled the question of how consistent they are in evaluating hate speech. Their study is published in Findings of the Association for Computational Linguistics.

Seven Major AI Models Under Review

Lelkes and Fasching analyzed seven models, including two designed specifically for content classification and others with more general applications: two from OpenAI, two from Mistral, along with Claude 3.5 Sonnet, DeepSeek V3, and the Google Perspective API. Their analysis includes 1.3 million synthetic sentences that make statements about 125 groups, encompassing both neutral terms and slurs, ranging from those related to religion to disabilities to age. Each sentence includes "all" or "some," a group, and a hate speech phrase.

The Models Make Different Decisions About the Same Content

"The research shows that content moderation systems have dramatic inconsistencies when evaluating identical hate speech content, with some systems flagging content as harmful while others deem it acceptable," Fasching says. This is a critical issue for the public, Lelkes says, because inconsistent moderation can erode trust and create perceptions of bias.

Fasching and Lelkes also found variation in the internal consistency of models: one demonstrated high predictability in classifying similar content, another produced different results for similar content, and others showed a more measured approach, neither over-flagging nor under-detecting content as hate speech. "These differences highlight the challenge of balancing detection accuracy with avoiding over-moderation," the researchers write.

The Variations Are Especially Pronounced for Certain Groups

"These inconsistencies are especially pronounced for specific demographic groups, leaving some communities more vulnerable to online harm than others," Fasching says.

He and Lelkes found that hate speech evaluations across the seven systems were more similar for statements about groups based on sexual orientation, race, and gender, while inconsistencies intensified for groups based on education level, personal interest, and economic class. This suggests "that systems generally recognize hate speech targeting traditional protected classes more readily than content targeting other groups," the authors write.

Models Handle Neutral and Positive Sentences Differently

A minority of the 1.3 million synthetic sentences were neutral or positive to assess false identification of hate speech and how models handled pejorative terms in non-hateful contexts, such as "All [slur] are great people."

The researchers found that Claude 3.5 Sonnet and Mistral's specialized content classification system treat slurs as harmful across the board, whereas other systems prioritize the context and intent. The authors are surprised to find that each model consistently fell into one of two camps, with little middle ground.

About the Researchers

Yphtach Lelkes is an associate professor of communication in the Annenberg School for Communication, co-director of the Polarization Research Lab, and co-director of the Center for Information Networks and Democracy.

Neil Fasching is a doctoral candidate in the Annenberg School for Communication and a member of the Democracy and Information Group.

The Annenberg School for Communication supported this research.

Source:

University of Pennsylvania

Journal reference:

Neil Fasching and Yphtach Lelkes. 2025. Model-Dependent Moderation: Inconsistencies in Hate Speech Detection Across LLM-based Systems. In Findings of the Association for Computational Linguistics: ACL 2025, pages 22271–22285, Vienna, Austria. Association for Computational Linguistics, https://aclanthology.org/2025.findings-acl.1144/

Posted in: AI Research News

Comments (0)

Download PDF Copy

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.