A study by the Annenberg School for Communication reveals substantial differences among AI models, including those from OpenAI, DeepSeek and Google, in detecting hate speech, which could have serious implications for content moderation and online community safety.
Artificial intelligence has emerged as a key player in moderating online content, especially hate speech, as platforms seek to curb political polarization and safeguard mental health. However, a recent study from the Annenberg School for Communication at the University of Pennsylvania highlights a critical issue: the evaluation of hate speech by leading AI models is far from consistent.
“Private technology companies have become the de facto arbiters of what speech is permissible in the digital public square, yet they do so without any consistent standard,” Yphtach Lelkes, an associate professor in the Annenberg School for Communication, said in a news release.
Lelkes and Annenberg doctoral student Neil Fasching conducted the first large-scale comparative analysis of AI content moderation systems, examining their consistency in evaluating hate speech.
Their study, published in the Findings of the Association for Computational Linguistics, analyzed seven prominent models: OpenAI’s two models, Mistral’s two models, Claude 3.5 Sonnet, DeepSeek V3 and Google Perspective API.
The researchers analyzed a staggering 1.3 million synthetic sentences covering 125 groups, using various terms, including neutral and slurs, related to religion, disability, age and more.
Key Takeaways From the Study
1. Inconsistent Decisions Across Models
“The research shows that content moderation systems have dramatic inconsistencies when evaluating identical hate speech content, with some systems flagging content as harmful while others deem it acceptable,” Fasching, who is a member of the Democracy and Information Group, said in the news release.
Lelkes, who is also a co-director of the Polarization Research Lab and the Center for Information Networks and Democracy, adds that this inconsistency can erode public trust and create perceptions of bias. The study found variance in internal consistency of models, highlighting the challenge of balancing detection accuracy with avoiding over-moderation.
2. Pronounced Inconsistencies for Certain Groups
“These inconsistencies are especially pronounced for specific demographic groups, leaving some communities more vulnerable to online harm than others,” Fasching added.
The research indicates more consistent hate speech evaluations for groups based on sexual orientation, race and gender, while variability increased for groups defined by education level, personal interests and economic class.
3. Different Handling of Neutral and Positive Sentences
Notably, a minority of the sentences were neutral or positive to test false identification of hate speech. Systems like Claude 3.5 Sonnet and Mistral’s specialized content classification treated all slurs as harmful, whereas others focused on context and intent.
The authors were surprised by the clear division in how models classified these cases, with little middle ground.
Source: Annenberg School for Communication, University of Pennsylvania

