A new training method from UC San Diego helps AI reason more like a careful student, not a guesser, especially on math problems that mix text and images. The approach could power safer AI tutors and more reliable analysis of charts, reports and scientific papers.
Artificial intelligence systems are getting better at answering tough questions, but they still have a bad habit: they can guess correctly without really understanding the problem.
Engineers at the University of California San Diego say they have built a smarter way to train AI so it has to show its work, especially on complex tasks that combine text and images, such as math word problems with charts and diagrams.
Their new method, presented at the NeurIPS conference in December 2025, pushed AI models to the top of widely used tests of visual mathematical reasoning. The researchers say the same ideas could lead to more trustworthy AI tutors, as well as tools that can reliably analyze business reports, complex charts and scientific papers with less risk of making things up.
Most current AI systems are trained and evaluated almost entirely on whether they land on the right final answer.
Study senior author Pengtao Xie, a professor in the Department of Electrical and Computer Engineering at the UC San Diego Jacobs School of Engineering, compared that approach to a familiar classroom experience.
“They are graded much like students taking a multiple-choice test,” he said in a news release. “If they select the right answer, they still receive full credit, even if they guessed.”
The UC San Diego team’s method flips that script. Instead of rewarding an AI model just for being right, the system scores how well the model reasons its way through a problem.
“It gets rewarded for thinking logically, step by step, rather than just guessing correctly,” Xie added. “If it gets the right answer using the wrong logic, it doesn’t get rewarded.”
That shift, from asking “Did the AI get it right?” to “Did the AI think it through?” is more than a philosophical change. In high-stakes settings like medical diagnosis, financial analysis or engineering design, a confident but poorly reasoned answer can be dangerous. A system that is trained to value sound reasoning over lucky guesses is better positioned to flag uncertainty, avoid shortcuts and provide explanations that humans can check.
Until now, this kind of “process-based” training has mostly been applied to text-only models. Extending it to multimodal models — systems that must interpret both language and images — adds another layer of difficulty: the training data itself.
AI models learn from massive collections of example problems and solutions. But not all data are created equal. Some datasets are rich, detailed and challenging. Others are noisy, too simple or only loosely related to the task. If a model treats all of that material as equally useful, it can slow down or even confuse its learning.
Xie illustrated the problem with a vivid comparison.
“It’s like trying to learn calculus when half of your reading list consists of kindergarten coloring books,” he said.
To tackle this, the team built a training system that acts as a kind of smart curator. Instead of feeding the model every example with the same importance, the method learns to assign different weights to different datasets. High-quality, challenging examples count more; low-quality or irrelevant ones are downplayed.
The system then checks its own progress on a separate set of problems and uses that feedback to keep adjusting how it prioritizes training data over time.
“Our system doesn’t just learn from everything,” added Xie. “It learns what is worth learning from. It emphasizes quality over quantity.”
This two-part strategy — grading the reasoning process and curating the training data — paid off in tests. When evaluated on multiple benchmarks that measure visual and mathematical reasoning, the team’s system consistently outperformed other training methods.
On MathVista, a widely used benchmark that tests how well AI can solve math word problems that include charts and diagrams, a model trained with the UC San Diego method achieved a top public score of 85.2%, according to the researchers. The result was verified by MathVista’s organizers.
Beyond raw scores, the team sees the work as a step toward making advanced reasoning AI more accessible. Many of today’s most capable models are huge, proprietary systems that require enormous computing resources to train and run. The new training approach helps smaller, open models narrow that gap, according to Xie.
“You don’t need a trillion-dollar computing cluster to get state-of-the-art reasoning,” he said.
That could open the door for schools, small companies and individual developers to build specialized AI tools that run on personal computers or modest servers, rather than relying entirely on tech giants’ cloud platforms.
For students, one promising application is AI tutors that can walk through a math or science problem line by line, checking each step for logical consistency instead of just spitting out an answer. For professionals, better multimodal reasoning could mean AI systems that can read a financial report, interpret its graphs and tables, and explain their implications in clear language — all while being less likely to misread a chart or invent a trend.
Next, the team plans to go even more granular in how they judge training data quality, moving from scoring entire datasets to evaluating individual questions and problems. They are also working to make the training process faster and less computationally demanding.
As AI systems become more deeply embedded in education, business and research, methods like this — that push models to think carefully, not just answer quickly — may play a key role in making the technology both more powerful and more trustworthy.

