Innovative AI Headphones Translate Multiple Speakers in Real Time

May 12, 2025

University of Washington researchers unveil AI headphones that translate multiple speakers in real-time, maintaining their unique voice qualities. This groundbreaking technology could revolutionize communication in diverse languages.

The University Network

Researchers at the University of Washington (UW) have developed groundbreaking AI-powered headphones that can translate multiple speakers simultaneously while preserving the unique qualities and directions of their voices. This innovative system, known as Spatial Speech Translation, promises a significant advancement in real-time language translation technology.

Tuochao Chen, a UW doctoral student in the Paul G. Allen School of Computer Science & Engineering, recently faced a common barrier during a museum tour in Mexico: the inability to understand Spanish amidst the surrounding noise when using a translation app on a phone. The experience underscored the limitations of current translation apps, which are often overwhelmed by background sounds.

Inspired by this challenge, Chen and his team set out to create a solution that could transcend these limitations.

“Other translation tech is built on the assumption that only one person is speaking,” senior author Shyam Gollakota, a UW professor in the Allen School, said in a news release. “But in the real world, you can’t have just one robotic voice talking for multiple people in a room. For the first time, we’ve preserved the sound of each person’s voice and the direction it’s coming from.”

The Spatial Speech Translation system employs off-the-shelf noise-canceling headphones fitted with microphones. The algorithms within the system work like radar, scanning the environment in 360 degrees to detect and track multiple speakers, translating their speech with a slight 2-4 second delay.

This methodology ensures that each speaker’s voice is preserved authentically, maintaining their expressive qualities and volume.

“Our algorithms work a little like radar,” added Chen. “So it’s scanning the space in 360 degrees and constantly determining and updating whether there’s one person or six or seven.”

The research team presented their findings at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan. The code for the proof-of-concept device is open-source, allowing others to build and expand on this pioneering work.

This system functions with mobile devices sporting an Apple M2 chip, such as laptops and the Apple Vision Pro, and avoids cloud computing to address privacy concerns related to voice cloning. When tested in 10 different indoor and outdoor environments, users consistently favored the new system over traditional models that did not track speakers through space.

In one of the user tests, participants preferred a 3-4 second delay, as the system made fewer errors compared to a 1-2 second delay.

While the device currently handles common speech rather than technical jargon, it has been successfully tested with Spanish, German and French. Previous translation models suggest that it could potentially be trained to handle approximately 100 languages in the future.

“This is a step toward breaking down the language barriers between cultures,” Chen added. “So if I’m walking down the street in Mexico, even though I don’t speak Spanish, I can translate all the people’s voices and know who said what.”

Source: University of Washington