xAI Launches Grok Speech-to-Text and Text-to-Speech APIs

xAI has released standalone speech-to-text and text-to-speech APIs built on the same infrastructure powering Grok Voice, Tesla vehicles and Starlink support. The pricing — $0.10/hour for transcription and $4.20 per million characters for voice — undercuts every major rival by a wide margin.

Peter Corrigan

xAI has launched two standalone audio APIs — Grok Speech to Text (STT) and Grok Text to Speech (TTS) — and made them immediately available to any developer with an xAI API key. No waitlist, no staged rollout. The announcement positions xAI as a direct challenger to OpenAI, ElevenLabs, Deepgram and AssemblyAI in a market that has been commoditizing fast but still carries steep price tags for many independent builders.

The headline differentiator isn’t a benchmark score — it’s the price. Grok TTS is priced at $4.20 per million characters. OpenAI currently charges $15 per million characters for its TTS product, ElevenLabs runs closer to $50 per million characters depending on subscription tier, and Cartesia and InWorld sit at $46.70 and $40, respectively. That puts xAI’s rate roughly 72% to 92% below its closest competitors. On the transcription side, Grok STT comes in at $0.10 per hour for batch processing and $0.20 per hour for real-time streaming — compared with $0.22 and $0.39 at ElevenLabs, $0.21 and $0.45 at AssemblyAI, and $0.31 and $0.55 at Deepgram. OpenAI’s Whisper and GPT-4o Transcribe run $0.36 per hour, with a cheaper mini tier at $0.18.

What the APIs Actually Do

The STT API supports both batch (REST) and real-time streaming (WebSocket) modes from a single unified endpoint — an architectural choice that simplifies things for developers who would otherwise need to juggle separate integrations for pre-recorded and live audio. Features include word-level timestamps, speaker diarization for identifying who said what in both recorded and live audio, multichannel audio support for clean speaker separation, and intelligent Inverse Text Normalization that converts spoken language into properly formatted output — rendering numbers, dates and currencies the way a human reader would expect rather than as raw transcribed words. The API supports more than 25 languages.

The TTS side emphasizes expressiveness. Developers can inject emotional cues and delivery instructions using inline and wrapping speech tags — options include [laugh], [sigh], [breath], <whisper>, <emphasis>, <slow>, and <pause>, among others. This addresses one of the longest-standing frustrations with TTS systems: technically accurate but emotionally flat output. ElevenLabs has been the prior benchmark for expressive AI voice, but its subscription-based pricing model — with separate character quotas, opaque overage mechanics, and features spread across tiers — adds friction before a developer can even start building.

Built on Production Infrastructure

xAI notes that these APIs are built on the same stack that powers Grok Voice on mobile, voice interfaces in Tesla vehicles, and Starlink customer support. That’s a meaningful credibility signal. The underlying infrastructure has already processed millions of real-world interactions across high-stakes environments — the kind of scale many startups cannot replicate or easily audit in a third-party provider. xAI also says the platform carries SOC 2, HIPAA and GDPR compliance postures, which matters for developers building in health care, legal, or financial verticals who need to think ahead to production requirements.

One important competitive nuance on the STT side: OpenAI’s Whisper does not support real-time streaming out of the box. Developers who need live transcription with OpenAI’s stack must use the Realtime API, which only reached general availability in August 2025. Grok STT bundles both modes into one endpoint from day one.

Why This Matters for Students and Early-Career Developers

For students, recent grads and indie developers, the practical impact is significant. Building a voice-enabled app — a podcast transcription tool, an AI study assistant with spoken output, a health care accessibility tool for a capstone project — has historically meant either accepting limited free tiers or paying rates that made side-project budgets unsustainable at any real usage volume. At $4.20 per million characters for TTS and $0.10 per hour for transcription, those calculations change materially.

The compliance certifications also mean a student or early-stage founder building in a regulated domain won’t necessarily need to migrate providers the moment they move from prototype to something real. That kind of continuity has real value when you’re trying to move fast.

For developers building voice products, the pricing alone makes Grok STT and TTS worth evaluating seriously — particularly for anyone who has been absorbing ElevenLabs or OpenAI rates on projects where voice is a core feature rather than an add-on. And for new grads building portfolios, the Grok voice API stack is directly transferable: the same infrastructure runs in production across consumer apps, automotive and enterprise support at a scale few providers can point to.

The speech API market is crowded, and xAI is entering late relative to established players. But aggressive pricing combined with expressive TTS controls and a unified STT endpoint is a credible opening move. The open question is whether xAI sustains these rates as usage scales — a pattern worth watching as the platform matures.

Source: xAI

Additional research sources