New AI Model Helps Machines Better Understand 3D World

A new AI model from Kaunas University of Technology helps computers move beyond simply “seeing” objects to understanding their meaning in real-world 3D scenes. The advance could make self-driving cars, drones and digital twins safer and more reliable.

The University Network

Self-driving cars that can spot a partially hidden pedestrian at dusk. Drones that can safely weave through crowded city streets. Digital twins of entire cities that update in near real time.

A team at Kaunas University of Technology (KTU) in Lithuania has developed an artificial intelligence model that brings these scenarios closer to everyday reality by helping machines understand the 3D world more like humans do.

The new model tackles one of the toughest problems in computer vision: making sense of 3D point clouds, the millions of data points that laser sensors collect when they scan streets, forests or buildings.

KTU professor Rytis Maskeliūnas explained the basic idea behind this technology.

“Imagine taking millions of precise laser measurements of a physical space, like a street, a forest, or an entire city, and stitching them together to create a detailed three-dimensional map made up of individual points. This is known as a 3D point cloud. The technology used to analyse it focuses on helping computers understand the shapes of objects in the map and interpret their context within the scene,” Maskeliūnas said in a news release.

These 3D maps already underpin many tools we use every day, from driver-assistance systems in modern cars to detailed digital models of cities used for urban planning and infrastructure monitoring.

KTU researcher Sarmad Maqsood noted that most people are surrounded by this technology without realizing it.

“An average person regularly encounters the underlying 3D data and technologies similar to those described in our work without even realising it,” Maqsood said in the news release.

Cars, cities and digital twins

In today’s vehicles, sensors and 3D data help power features such as automatic emergency braking and adaptive cruise control. These systems need to distinguish between pedestrians, cyclists, vehicles and road edges, often in poor weather, low light or crowded environments.

Beyond transportation, 3D point clouds are used to build high-resolution digital models of urban areas. These models support so-called digital twins — virtual replicas of real-world environments that can be updated continuously to track changes in buildings, roads, vegetation and more.

But teaching computers to read these complex 3D scenes is far from simple.

“Computers face significant difficulties in analysing 3D point clouds primarily because this data type is inherently irregular, unstructured, and massive,” Maqsood added.

In a point cloud, nearby objects may be captured with many dense points, while distant ones are represented sparsely. Important elements such as pedestrians or small obstacles can appear far less frequently than dominant surfaces like roads or building facades. On top of that, real-world data is full of noise and occlusions, where objects block each other from view.

All of this makes it hard for algorithms to reliably identify and label each point as part of a road, tree, vehicle, person or other object — especially when decisions need to be made in real time for safety-critical systems.

A hybrid model that sees both detail and context

To overcome these challenges, the KTU team designed a model that blends several ways of analyzing 3D data into a single, unified system.

Traditional approaches often focus either on local details — the fine-grained shape of a curb or a car bumper — or on the global structure of a scene, such as the overall layout of a street. The new model is built to do both at once.

At its core is a transformer-based method, a type of AI architecture originally popularized in natural language processing. In this context, transformers help the system capture relationships across an entire 3D scene, rather than treating each region in isolation.

On top of that, the model includes mechanisms that deliberately emphasize rare but important features. That makes it better at handling imbalanced data, where small or less frequent objects might otherwise be overlooked.

Maskeliūnas compared the challenge to sorting out a chaotic 3D jigsaw.

“Imagine you have a massive, messy 3D puzzle made of millions of points that needs to be sorted into meaningful objects like roads, trees, and pedestrians. Our model acts like a highly intelligent and efficient puzzle-solver,” he said.

By learning how points relate to one another across the whole scene and by boosting the signal from underrepresented objects, the system improves detection of small, partially hidden or sparsely captured items that older methods might miss.

Seeing the person in the noise

This ability becomes crucial in real-world scenarios, such as an autonomous vehicle approaching an intersection at dusk. In that situation, a pedestrian might only appear as a handful of scattered points, partially obscured by a parked car or street furniture.

Instead of treating those few points as meaningless noise, the KTU model uses context to infer what is really there. It relates the sparse data to nearby structures like a pole, sidewalk or crosswalk and infers that those points likely belong to a person.

This contextual reasoning could have a direct impact on safety, according to Maskeliūnas.

“Instead of missing this information, the model interprets it in context – relating sparse signals to surrounding elements such as a pole or a crosswalk – and identifies the presence of a person even when the data is incomplete. This ability to interpret context from limited information could significantly improve safety in autonomous systems,” said Maskeliūnas.

Crucially, the system is designed to be efficient as well as accurate. According to the researchers, it can process complex 3D scenes in just over two seconds per frame while maintaining high performance.

Maqsood emphasized that the technical advance is not only about better segmentation — the task of assigning each point to a category like road, tree or pedestrian — but also about how the entire workflow is streamlined.

“Beyond segmentation accuracy, a key achievement is the demonstration of an efficient, unified pipeline,” he said.

The model integrates compression and transmission into the process, allowing large-scale 3D data to be handled and shared in near real time without losing essential detail. That is important for applications where data must be sent between vehicles, drones, servers or city management systems.

What comes next

While the work is immediately relevant to autonomous driving and smart cities, the researchers see many other potential uses.

Delivery drones could navigate cluttered, unpredictable environments more safely. Robots in search-and-rescue missions could better interpret collapsed structures or debris fields from sparse sensor data. Archaeologists might reconstruct ruins from limited scans, and forensic investigators could analyze subtle spatial details at crime or accident scenes.

The same underlying capability — turning messy, incomplete 3D measurements into meaningful understanding — could also power more advanced augmented reality, where digital content is precisely anchored to complex real-world spaces.

As 3D sensing becomes cheaper and more widespread, from lidar-equipped phones to city-scale mapping projects, tools like the KTU model may become a key part of how machines perceive and manage the built and natural environment.

The team’s study, published in the journal Remote Sensing of Environment, suggests that machines are moving beyond simply capturing the world in three dimensions. They are beginning to interpret it in ways that are closer to how people see and understand their surroundings.

Source: Kaunas University of Technology