New Technique Helps AI Chatbots Stay Safe Without Losing Smarts

As AI chatbots move into classrooms, workplaces and homes, NC State researchers say safety training does not have to come at the expense of performance. Their new framework and technique aim to keep fine-tuned models from slipping into unsafe behavior.

As artificial intelligence chatbots become everyday tools for work, school and personal advice, a team at North Carolina State University has unveiled a way to make them safer without dulling their capabilities.

The researchers focused on large language models, or LLMs, the technology behind systems like ChatGPT. These models can draft emails, explain homework, write code and even offer step-by-step instructions for complex tasks. That power also creates risk when users ask for help with self-harm, crime or other dangerous activities.

“We don’t want LLMs to tell people to harm themselves or to give them information they can use to harm other people,” corresponding author Jung-Eun Kim, an assistant professor of computer science at NC State, said in a news release.

The work zeroes in on what AI researchers call safety alignment: the training that nudges a model’s responses to match human values and social norms. In practice, that means refusing to answer some questions, redirecting others, and giving more cautious or supportive responses in sensitive situations.

Kim noted the team is tackling “two challenges” that have made safety alignment difficult to get right.

“The first challenge is the so-called alignment tax, which refers to the fact that incorporating safety alignment has an adverse effect on the accuracy of a model’s outputs,” Kim added.

In other words, making a model safer can sometimes make it less helpful or less precise on legitimate questions.

“The second challenge is that existing LLMs generally incorporate safety alignment at a superficial level, which makes it possible for users to circumvent safety features,” first author Jianwei Li, a doctoral student student at NC State, said in the news release.

Li pointed to a simple example: “For example, if a user asks for instructions to steal money, a model will likely refuse. But if a user asks for instructions to steal money in order to help people, the model would be more likely to provide that information.”

That kind of loophole becomes even more concerning when people fine-tune models. Fine-tuning is the process of taking a general-purpose LLM and training it further on specialized data for a particular company, industry or task. Previous research has shown that fine-tuning can unintentionally weaken a model’s built-in safety protections.

“Our goal with this work was to provide a better understanding of existing safety alignment issues and outline a new direction for how to implement a non-superficial safety alignment for LLMs,” Li added.

To organize the problem, the team introduced what they call the Superficial Safety Alignment Hypothesis. This framework describes how most current LLMs treat safety as a simple, one-time decision: a user request is labeled either safe or unsafe at the very start of the response process. If it is judged safe, the model goes ahead and generates an answer. If it is judged unsafe, the model declines.

That binary approach, the researchers argue, leaves models vulnerable to clever rephrasing and context tricks. It also means the model does not reconsider safety as it builds a long, multi-step answer.

Using this hypothesis as a guide, the team dug into the inner workings of LLMs and identified specific safety-critical components, which they refer to as “neurons” within the model’s neural network. These are internal units that strongly influence whether the system chooses to fulfill or refuse a request.

“We found that ‘freezing’ these specific neurons during the fine-tuning process allows the model to retain the safety characteristics of the original model while adapting to new tasks in a specific domain,” added Li.

In practical terms, freezing means those safety-related parts of the model are locked in place while the rest of the system learns new patterns from specialized data. That lets organizations customize a model for, say, legal writing or customer support, without accidentally teaching it to ignore safety rules.

“And we demonstrated that we can minimize the alignment tax while preserving safety alignment during the fine-tuning process,” Kim added.

The team’s results suggest that developers do not have to choose between safety and performance as starkly as before. By isolating and protecting the parts of a model that govern refusal behavior, they were able to keep safety intact while still improving the model’s skills in a new domain.

Kim emphasized the work does two things at once: it offers a conceptual lens for understanding why current safety methods fall short, and it delivers a concrete technique that can be used today when fine-tuning models.

“The big picture here is that we have developed a hypothesis that serves as a conceptual framework for understanding the challenges associated with safety alignment in LLMs, used that framework to identify a technique that helps us address one of those challenges, and then demonstrated that the technique works,” Kim said.

Still, the researchers see their contribution as a starting point rather than a final fix. Their hypothesis highlights a deeper limitation: current models tend to make a single safety judgment up front and then stick with it, even if the response veers into more dangerous territory as it unfolds.

“Moving forward, our work here highlights the need to develop techniques that will allow models to continuously re-evaluate and re-select their reasoning direction – safe or unsafe –throughout the response generation process,” Li added.

The study, titled “Superficial Safety Alignment Hypothesis,” will be presented at the Fourteenth International Conference on Learning Representations (ICLR 2026) in Rio de Janeiro, Brazil. The team has also released code and additional details online so other researchers and developers can explore and build on their approach.

As LLMs continue to spread into classrooms, clinics, workplaces and homes, methods like these could help ensure that more powerful AI systems remain not just useful, but reliably safe.

Source: North Carolina State University