ElevenLabs, an AI startup recently valued at $3.3 billion after raising $180 million in funding, has made a move beyond its well-known audio-generation capabilities by launching its first standalone speech-to-text model, called Scribe.

Previously recognized for providing speech-to-text services through its vast voice library, ElevenLabs is now aiming to compete in the speech detection space, positioning itself alongside industry players such as Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper models.

Scribe supports over 99 languages at launch, with more than 25 languages falling under the “excellent accuracy” category, where the word error rate is under 5%. This includes languages like English (with an accuracy rate of 97%), French, German, Hindi, Japanese, Portuguese, Spanish, and Vietnamese. Other languages are categorized by their error rates, ranging from high (5% to 10%) to moderate (25% to 50%).

According to the company, Scribe outperformed Google’s Gemini 2.0 Flash and OpenAI’s Whisper Large V3 in multiple languages across the FLEURS & Common Voice benchmark tests.

While ElevenLabs previously developed a speech-to-text component for its AI conversational agent platform released last year, Scribe marks the company’s first standalone speech detection model. In a recent conversation with TechCrunch, CEO Mati Staniszewski shared the company’s vision to improve speech detection models.

“We want to better understand what’s being said in a conversation. We’re focusing on moving beyond just generating content to truly understanding and transcribing speech,” Staniszewski said. “Many people think speech-to-text is already solved, but for many languages, it’s still far from perfect. We believe we can create better models because we have in-house teams to annotate data and provide quick feedback.”

Scribe also features smart speaker diarization to identify who is speaking, word-level timestamps for accurate subtitles, and auto-tagging of sound events like audience laughter. The platform offers customers the ability to transcribe video content for subtitles or captions directly in its studio.

Currently, Scribe supports only pre-recorded audio, but ElevenLabs plans to release a low-latency, real-time version soon, which will be useful for meeting transcriptions and voice note-taking.

At $0.40 per hour of transcribed audio, Scribe offers a competitive pricing model, though some competitors currently offer lower rates with differing features.