New Method to Steer AI Language Models Reveals Risks and Rewards

A new study shows that researchers can directly steer concepts inside large language models, making them more accurate and efficient — but also easier to jailbreak. The work opens a path to safer, more transparent AI while underscoring how fragile current guardrails can be.

The University Network

Researchers have found a way to reach inside large language models and turn specific ideas up or down like a volume knob — a breakthrough that could make artificial intelligence both safer and more powerful, while also exposing new vulnerabilities.

The method, developed by a team led by Mikhail Belkin at the University of California San Diego and Adit Radhakrishnan at the Massachusetts Institute of Technology, allows scientists to locate and mathematically manipulate concepts encoded deep within popular AI systems. The work is published in the journal Science.

Large language models, or LLMs, power tools such as chatbots and code assistants. They are famously opaque: they generate fluent text, but how they represent ideas like fear, refusal, or conspiracy remains largely hidden inside billions of numerical connections. That black-box quality has made it hard to understand why models sometimes excel, sometimes fail, and sometimes go dangerously off the rails.

Belkin’s group set out to change that by building on earlier work they published in 2024, which introduced predictive algorithms called Recursive Feature Machines. Those algorithms can detect patterns in the mathematical operations inside an LLM that correspond to particular concepts.

In the new study, the researchers used those patterns as handles. Once they identified where a concept lived inside a model, they increased or decreased its influence on the model’s output using relatively straightforward math.

“We found that we could mathematically modify these patterns with math that is surprisingly simple,” Belkin, a professor in UC San Diego’s Halıcıoğlu Data Science Institute, which is part of the School of Computing, Information and Data Sciences, said in a news release.

The team tested its steering approach on several of the largest open-source LLMs available today, including models such as Llama and Deepseek. They were able to identify and influence 512 distinct concepts grouped into five broad classes, ranging from fears and moods to locations.

Crucially, the method worked across languages. The researchers showed they could steer concepts not only in English, but also in languages such as Chinese and Hindi, suggesting that the internal representations they are tapping into are robust and general.

By turning concepts up or down, the team could change how the models behaved in targeted ways.

On the positive side, steering improved performance on narrow, precise tasks. For example, the researchers showed that they could boost an LLM’s ability to translate computer code from Python to C++, a demanding task that requires careful attention to syntax and logic. They also used the method to help identify hallucinations, the confident but incorrect statements that have become a notorious weakness of modern AI.

The same technique, however, can be used to undermine safety systems.

One of the concepts the team located was refusal — the internal tendency of a model to decline harmful or inappropriate requests. By dialing down the importance of that concept, they were able to push models outside their built-in guardrails, a practice known as jailbreaking.

Under this manipulated setting, a model provided instructions on how to use cocaine. In another case, it returned Social Security numbers, though the researchers could not determine whether those numbers were real or fabricated.

The team also showed that they could amplify political bias and a conspiratorial mindset. In one experiment, a steered model claimed a satellite image of Earth was part of a NASA plot to hide that the planet is flat. In another, a model asserted that the COVID vaccine was poisonous.

These examples highlight a double-edged reality: the same fine-grained control that can make AI safer and more reliable can also be weaponized to make it more deceptive or extreme.

Beyond safety, the new method offers a practical advantage: it is far less computationally demanding than many existing techniques for probing or retraining large models.

Using a single NVIDIA Ampere series A100 graphics processing unit, the researchers needed less than one minute and fewer than 500 training samples to identify the relevant patterns and steer them toward a concept of interest. That efficiency suggests the approach could be integrated into standard LLM training and monitoring pipelines without massive hardware costs.

The team was not able to test commercial, closed-source models such as Claude, but they believe the same principles should apply broadly to open-source systems of different sizes.

“We observed that newer and larger LLMs were more steerable,” the researchers write, indicating that as models grow, their internal concepts may become even easier to isolate and control. They also suggest that the method could extend to smaller open-source models that can run on a laptop, potentially democratizing access to fine-grained AI steering tools.

Looking ahead, the researchers plan to refine their approach so that steering can adapt to specific inputs and applications. Instead of globally turning a concept up or down, future methods might adjust a model’s internal settings on the fly, depending on the context of a user’s request.

That kind of dynamic control could help AI systems better match human values, avoid harmful outputs, and reveal when they are uncertain or hallucinating — all key goals for making AI trustworthy in high-stakes settings like education, health care, law and public policy.

“These results suggest that the models know more than they express in responses and that understanding internal representations could lead to fundamental performance and safety improvements,” the researchers write.

The work points toward a future in which AI models are not just powerful text generators, but systems whose inner workings can be monitored, steered and audited.

For students and researchers entering the field, the study underscores a growing shift in AI: from building ever-larger black boxes to opening them up, concept by concept, in search of systems that are not only capable, but also controllable and accountable.

Source: University of California San Diego