# Telnyx Voice: STT — Full Documentation > Complete page content for STT (Voice section) of the Telnyx developer docs (https://developers.telnyx.com). > Root index: https://developers.telnyx.com/llms.txt · Lightweight index for this subsection: https://telnyx-openapi-ng.s3.us-east-1.amazonaws.com/llms/voice/stt.txt ## ### Overview > Source: https://developers.telnyx.com/docs/voice/stt/overview.md Telnyx STT transcribes audio to text in three ways: Stream audio over a persistent WebSocket connection. Real-time partial and final transcripts. Upload audio files via REST API. OpenAI-compatible endpoint. Enable transcription during live voice calls via Call Control or TeXML. --- ### Quickstart > Source: https://developers.telnyx.com/docs/voice/stt/getting-started.md ## Prerequisites - A Telnyx account — [sign up here](https://telnyx.com/sign-up) - An API key — create one in the [portal](https://portal.telnyx.com/#/app/api-keys) That's it. No other setup required. Real-time transcription over a persistent connection. Send audio, get partial and final transcripts back as they happen. **Install:** ```bash pip install "websockets>=14" ``` **`main.py`:** ```python import asyncio import json import urllib.request import websockets API_KEY = "YOUR_TELNYX_API_KEY" STREAM_URL = "https://kexp-mp3-128.streamguys1.com/kexp128.mp3" async def transcribe(): url = ( "wss://api.telnyx.com/v2/speech-to-text/transcription" "?transcription_engine=Deepgram" "&model=nova-3" "&input_format=mp3" "&interim_results=true" ) headers = {"Authorization": f"Bearer {API_KEY}"} async with websockets.connect( url, additional_headers=headers ) as ws: # Listen for transcripts async def listen(): async for message in ws: data = json.loads(message) transcript = data.get("transcript", "") if not transcript: continue prefix = "FINAL" if data.get("is_final") else "partial" print(f"[{prefix}] {transcript}") listener = asyncio.create_task(listen()) # Stream audio from KEXP Radio req = urllib.request.urlopen(STREAM_URL) try: while chunk := req.read(4096): await ws.send(chunk) await asyncio.sleep(0.05) except KeyboardInterrupt: pass await ws.send(json.dumps({"type": "CloseStream"})) listener.cancel() asyncio.run(transcribe()) ``` **Run it:** ```bash python main.py ``` **Install:** ```bash npm install ws ``` **`index.js`:** ```javascript const WebSocket = require("ws"); const https = require("https"); const API_KEY = "YOUR_TELNYX_API_KEY"; const STREAM_URL = "https://kexp-mp3-128.streamguys1.com/kexp128.mp3"; const url = new URL("wss://api.telnyx.com/v2/speech-to-text/transcription"); url.searchParams.set("transcription_engine", "Deepgram"); url.searchParams.set("model", "nova-3"); url.searchParams.set("input_format", "mp3"); url.searchParams.set("interim_results", "true"); const ws = new WebSocket(url.toString(), { headers: { Authorization: `Bearer ${API_KEY}` }, }); ws.on("open", () => { console.log("Connected. Streaming KEXP Radio...\n"); https.get(STREAM_URL, (stream) => { stream.on("data", (chunk) => { if (ws.readyState === WebSocket.OPEN) { ws.send(chunk); } }); }); }); ws.on("message", (data) => { const msg = JSON.parse(data); const transcript = msg.transcript || ""; if (!transcript) return; const prefix = msg.is_final ? "FINAL" : "partial"; console.log(`[${prefix}] ${transcript}`); }); ws.on("error", (err) => console.error("Error:", err.message)); ``` **Run it:** ```bash node index.js ``` ### Example output ``` Connected. Streaming KEXP Radio... [partial] the latest news from [partial] the latest news from the BBC [FINAL] The latest news from the KEXP Radio. [partial] tensions continue [partial] tensions continue to rise in the [FINAL] Tensions continue to rise in the region as diplomatic talks stall. ``` Upload an audio file and get the full transcript back. The endpoint is OpenAI SDK compatible — swap `base_url` and `api_key` and your existing code works. **Install:** ```bash pip install openai ``` **`main.py`:** ```python from openai import OpenAI client = OpenAI( api_key="YOUR_TELNYX_API_KEY", base_url="https://api.telnyx.com/v2", ) result = client.audio.transcriptions.create( model="openai/whisper-large-v3-turbo", file=open("audio.mp3", "rb"), ) print(result.text) ``` **Run it:** ```bash python main.py ``` **Install:** ```bash npm install openai ``` **`index.js`:** ```javascript const OpenAI = require("openai"); const fs = require("fs"); const client = new OpenAI({ apiKey: "YOUR_TELNYX_API_KEY", baseURL: "https://api.telnyx.com/v2", }); (async () => { const result = await client.audio.transcriptions.create({ model: "openai/whisper-large-v3-turbo", file: fs.createReadStream("audio.mp3"), }); console.log(result.text); })(); ``` **Run it:** ```bash node index.js ``` ```bash curl -X POST https://api.telnyx.com/v2/ai/audio/transcriptions \ -H "Authorization: Bearer YOUR_TELNYX_API_KEY" \ -F model="openai/whisper-large-v3-turbo" \ -F file=@audio.mp3 ``` Or transcribe from a URL (no file upload needed): ```bash curl -X POST https://api.telnyx.com/v2/ai/audio/transcriptions \ -H "Authorization: Bearer YOUR_TELNYX_API_KEY" \ -F model="openai/whisper-large-v3-turbo" \ -F file_url="https://example.com/audio.mp3" ``` ### Example response ```json { "text": "The latest news from the KEXP Radio. Tensions continue to rise in the region as diplomatic talks stall." } ``` For segment- or word-level timestamps, use `model="deepgram/nova-3"` with `response_format=verbose_json`. The Whisper models (`openai/whisper-large-v3-turbo`, `openai/whisper-tiny`) return text only. ## What's next - [Models & Engines](/docs/voice/stt/models) — pick the right engine and model for your use case - [WebSocket Parameters](/docs/voice/stt/websocket-streaming/parameters) — interim results, keyword boosting, endpointing, redaction - [REST API Parameters](/docs/voice/stt/rest-api/parameters) — all request body fields for file-based transcription - [In-Call Transcription](/docs/voice/stt/in-call-transcription) — enable STT during live voice calls --- ### Models > Source: https://developers.telnyx.com/docs/voice/stt/models.md ## Comparison | Engine | Model (WebSocket) | Model (REST) | Latency | Languages | Best for | |--------|-------------------|--------------|---------|-----------|----------| | **Deepgram** | `nova-3` | `deepgram/nova-3` | Low | 40+ ([reference](https://developers.deepgram.com/docs/models-languages-overview)) | **Recommended.** Highest English accuracy, diarization, word timestamps | | **Deepgram** | `nova-2` | `deepgram/nova-2` | Low | 40+ | Legacy — use nova-3 unless you have a specific reason | | **Deepgram** | `flux` | — | **Lowest** | 10 languages | Voice agents — built-in end-of-turn detection (WebSocket only) | | **Telnyx** | `openai/whisper-large-v3-turbo` | `openai/whisper-large-v3-turbo` | Medium | 50+ ([reference](https://github.com/openai/whisper#available-models-and-languages)) | Multilingual transcription | | **Telnyx** | `openai/whisper-tiny` | `openai/whisper-tiny` | Low | 50+ | Lightweight, on-network | | **Google** | `latest_long` | — | Medium | 125+ ([reference](https://cloud.google.com/speech-to-text/docs/speech-to-text-supported-languages)) | Long-form multilingual audio (WebSocket only) | | **Azure** | `azure/fast` | — | Medium | 100+ ([reference](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt)) | Broad language and accent coverage (WebSocket only) | | **xAI** | `xai/grok-stt` | — | Low | 25 languages | Grok STT for real-time transcription (WebSocket and Voice API only) | | **AssemblyAI** | `assemblyai/universal-streaming` | — | Low | 6 languages | Universal-Streaming for voice agents with low latency and turn detection (WebSocket and Voice API only) | | **Speechmatics** | `speechmatics/standard` | — | Low | 17+ languages | High-accuracy real-time transcription with bilingual and multilingual packs (WebSocket and Voice API only) | | **Soniox** | `soniox/stt-rt-v4` | — | Low | Auto-detect | Real-time transcription with interim results and endpointing (WebSocket and Voice API only) | ## Engine Details The default WebSocket engine. Best English accuracy and the richest feature set. For REST, you must explicitly set `model="deepgram/nova-3"` — the REST default is `openai/whisper-large-v3-turbo`. **Models:** - **`nova-3`** — Latest and most accurate. Supports diarization, word-level timestamps, smart formatting, numerals, and punctuation via [`model_config`](/docs/voice/stt/rest-api/parameters/model-config). Use this unless you need the lowest possible latency. - **`nova-2`** — Previous generation. Still supported but nova-3 is better in all benchmarks. - **`flux`** — Purpose-built for voice agents. Lowest latency with built-in [end-of-turn detection](/docs/voice/stt/websocket-streaming/parameters/end-of-turn) — tells you when the speaker has finished so your agent can respond. WebSocket only. **Languages:** 40+ languages across Deepgram models. Nova-3 supports `multi` mode (10 languages with code-switching). Flux supports English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch. See [Deepgram languages](https://developers.deepgram.com/docs/models-languages-overview). Telnyx runs Whisper models on-network. **Models:** - **`openai/whisper-large-v3-turbo`** — Multilingual (50+ languages, auto-detected). Returns text only — no timestamps regardless of response format. - **`openai/whisper-tiny`** — Lightweight, lowest resource usage. **Languages:** 50+ languages, auto-detected. Use `auto_detect` to skip the language hint. See the [Whisper language list](https://github.com/openai/whisper#available-models-and-languages). **Limitations:** No diarization. No word-level timestamps. Google Cloud Speech-to-Text integration. **Model:** `latest_long` **Languages:** 125+ languages/locales. See [Google Cloud STT languages](https://cloud.google.com/speech-to-text/docs/speech-to-text-supported-languages). Microsoft Azure Speech Services integration. **Model:** `azure/fast` **Languages:** 100+ languages/locales with strong accent and dialect coverage. See [Azure Speech languages](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt). xAI Grok STT integration for real-time transcription. **Model:** `xai/grok-stt` **Languages:** 25 languages, including Arabic, English, French, German, Hindi, Japanese, Korean, Portuguese, Spanish, and Vietnamese. AssemblyAI Universal-Streaming integration for real-time voice agent transcription. **Model:** `assemblyai/universal-streaming` **Languages:** English, Spanish, German, French, Portuguese, and Italian. Speechmatics real-time transcription integration with high accuracy and multilingual support including bilingual packs. **Model:** `speechmatics/standard` **Languages:** English, Spanish, plus bilingual/multilingual packs including Arabic–English, Mandarin–English, English–Malay, English–Tamil, Tagalog, and Spanish–English bilingual. Also supports Basque, Galician, Irish, Maltese, Mongolian, Swahili, Uyghur, and Welsh. **Features:** Supports interim results (partial transcripts) and graceful `CloseStream` shutdown. Soniox real-time transcription integration with automatic language detection. **Model:** `soniox/stt-rt-v4` **Languages:** Automatic detection — no language hint required. **Features:** Supports interim results (partial transcripts), endpointing, and graceful `CloseStream` shutdown. ## How to Choose **Need the highest accuracy for English?** → Deepgram `nova-3` — best WER (word error rate) across all English variants. **Building a voice agent that needs to know when the user stopped talking?** → Deepgram `flux` — lowest latency with built-in end-of-turn detection. **Need to transcribe files in 50+ languages?** → Telnyx `openai/whisper-large-v3-turbo` via REST API. **Need diarization (who said what)?** → Deepgram `nova-3` with `model_config.diarize: true`. **Need broad accent/dialect support?** → Azure `azure/fast` — strong coverage across regional accents. **Need Grok STT for real-time calls?** → xAI `xai/grok-stt` via WebSocket or Voice API. **Need low-latency streaming for voice agents?** → AssemblyAI `assemblyai/universal-streaming` via WebSocket or Voice API. **Need high-accuracy multilingual with bilingual packs?** → Speechmatics `speechmatics/standard` via WebSocket or Voice API. **Need real-time transcription with automatic language detection?** → Soniox `soniox/stt-rt-v4` via WebSocket or Voice API. ## Specifying the Engine and Model **WebSocket** — set via query parameters: ``` wss://api.telnyx.com/v2/speech-to-text/transcription?transcription_engine=Deepgram&model=nova-3 ``` **REST API** — set via the `model` body parameter: ```bash curl -X POST https://api.telnyx.com/v2/ai/audio/transcriptions \ -H "Authorization: Bearer YOUR_TELNYX_API_KEY" \ -F model="deepgram/nova-3" \ -F file=@audio.mp3 ``` --- ### Migration > Source: https://developers.telnyx.com/docs/voice/stt/migration.md Most migrations require changing 2–3 lines of code. Pick your current provider below. ### WebSocket ```diff - wss://api.deepgram.com/v1/listen?model=nova-2&language=en - Authorization: Token DEEPGRAM_KEY + wss://api.telnyx.com/v2/speech-to-text/transcription?transcription_engine=Deepgram&model=nova-2&language=en + Authorization: Bearer TELNYX_KEY ``` The wire protocol is the same — send binary audio frames, receive JSON transcripts. **What changes** | | Deepgram | Telnyx | |---|---|---| | Auth scheme | `Token` | `Bearer` | | Engine | implicit | `transcription_engine=Deepgram` | | Model name | `nova-2`, `nova-3`, `flux` | `nova-2`, `nova-3`, `flux` | **Response field mapping** | Deepgram | → Telnyx | |---|---| | `results.channels[0].alternatives[0].transcript` | `transcript` | | `is_final` | `is_final` | | `speech_final` | `is_final` | ### REST ```bash # Before (Deepgram) curl -X POST https://api.deepgram.com/v1/listen?model=nova-2 \ -H "Authorization: Token DEEPGRAM_KEY" \ -H "Content-Type: audio/wav" \ --data-binary @audio.wav # After (Telnyx) curl -X POST https://api.telnyx.com/v2/ai/audio/transcriptions \ -H "Authorization: Bearer TELNYX_KEY" \ -F model="deepgram/nova-3" \ -F file=@audio.wav ``` **What changes** | | Deepgram | Telnyx | |---|---|---| | Auth scheme | `Token` | `Bearer` | | Body | raw binary | `multipart/form-data` | | Model name | `nova-2` | `deepgram/nova-3` | **Response field mapping** | Deepgram | → Telnyx | |---|---| | `results.channels[0].alternatives[0].transcript` | `text` | | `results.channels[0].alternatives[0].words` | available via `model_config.diarize` / `model_config.smart_format` | ### WebSocket ```diff - wss://api.elevenlabs.io/v1/speech-to-text/realtime?model_id=scribe_v1&language_code=en - xi-api-key: ELEVENLABS_KEY + wss://api.telnyx.com/v2/speech-to-text/transcription?transcription_engine=Deepgram&model=nova-3&language=en + Authorization: Bearer TELNYX_KEY ``` The wire protocol is the same — send binary audio frames, receive JSON transcripts. **Config mapping** | ElevenLabs | Telnyx query parameter | |---|---| | `model_id` | `transcription_engine` + `model` | | `language_code` | `language` | | `keywords` | `keyterm` (Nova-3/Flux) — see [Keyword Boosting](/docs/voice/stt/websocket-streaming/parameters/keyword-boosting) | **Response field mapping** | ElevenLabs | → Telnyx | |---|---| | `text` | `transcript` | | `is_final` | `is_final` | ### REST ```bash # Before (ElevenLabs) curl -X POST https://api.elevenlabs.io/v1/speech-to-text \ -H "xi-api-key: ELEVENLABS_KEY" \ -F "audio=@audio.mp3" \ -F "model_id=scribe_v1" # After (Telnyx) curl -X POST https://api.telnyx.com/v2/ai/audio/transcriptions \ -H "Authorization: Bearer TELNYX_KEY" \ -F model="openai/whisper-large-v3-turbo" \ -F file=@audio.mp3 ``` **What changes** | | ElevenLabs | Telnyx | |---|---|---| | Auth header | `xi-api-key` | `Authorization: Bearer` | | File field | `audio` | `file` | | Model field | `model_id` | `model` | **Response shape:** identical. Telnyx returns the same `{"text": "..."}` shape — no parsing changes needed. ### REST The easiest migration — Telnyx REST is OpenAI SDK compatible. Change the API key and base URL, everything else stays the same. ```python Python from openai import OpenAI client = OpenAI( - api_key="sk-OPENAI_KEY", + api_key="YOUR_TELNYX_API_KEY", + base_url="https://api.telnyx.com/v2", ) result = client.audio.transcriptions.create( model="openai/whisper-large-v3-turbo", file=open("audio.mp3", "rb"), ) ``` ```javascript JavaScript import OpenAI from "openai"; import fs from "fs"; const client = new OpenAI({ - apiKey: "sk-OPENAI_KEY", + apiKey: "YOUR_TELNYX_API_KEY", + baseURL: "https://api.telnyx.com/v2", }); const result = await client.audio.transcriptions.create({ model: "openai/whisper-large-v3-turbo", file: fs.createReadStream("audio.mp3"), }); ``` **What changes** | | OpenAI | Telnyx | |---|---|---| | `api_key` | `sk-...` | Telnyx API key | | `base_url` | (default) | `https://api.telnyx.com/v2` | | Method | `client.audio.transcriptions.create(...)` | unchanged | | Response | `result.text` | unchanged | **Response shape:** Telnyx returns the same `{"text": "..."}` shape — no parsing changes needed for the default `json` response format. **Note on `verbose_json`:** OpenAI's Whisper API returns segments with timestamps when you set `response_format=verbose_json`. Telnyx's Whisper models (`openai/whisper-large-v3-turbo`, `openai/whisper-tiny`) return text only — no segments. If you need timestamps, switch to `model="deepgram/nova-3"` which supports segment- and word-level timestamps via [`model_config`](/docs/voice/stt/rest-api/parameters/model-config). ### WebSocket Google uses gRPC with protobuf. Telnyx uses WebSocket with JSON — no protobuf compilation, no service account credentials. ```python # Before (Google Cloud) from google.cloud import speech client = speech.SpeechClient() config = speech.RecognitionConfig( encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16, sample_rate_hertz=16000, language_code="en-US", enable_automatic_punctuation=True, ) streaming_config = speech.StreamingRecognitionConfig( config=config, interim_results=True, ) # ... gRPC streaming setup ``` ``` # After (Telnyx) wss://api.telnyx.com/v2/speech-to-text/transcription?transcription_engine=Google&language=en-US&interim_results=true&input_format=linear16&sample_rate=16000 Authorization: Bearer TELNYX_KEY ``` See the [Quickstart](/docs/voice/stt/getting-started) for full WebSocket code. **Config mapping** | Google gRPC | Telnyx query parameter | |---|---| | `language_code` | `language` | | `encoding` | `input_format` | | `sample_rate_hertz` | `sample_rate` | | `interim_results` | `interim_results` | | `enable_automatic_punctuation` | enabled by default | **What you drop:** protobuf definitions, gRPC client setup, service account credentials. **Response field mapping** | Google Cloud | → Telnyx | |---|---| | `results[0].alternatives[0].transcript` | `transcript` | | `results[0].is_final` | `is_final` | ### WebSocket AWS Transcribe Streaming uses HTTP/2 with event streams via the [`amazon-transcribe-streaming-sdk`](https://github.com/awslabs/amazon-transcribe-streaming-sdk). Telnyx uses a plain WebSocket — no AWS SDK, no SigV4 signing, no IAM credentials. ```python # Before (AWS) — amazon-transcribe-streaming-sdk from amazon_transcribe.client import TranscribeStreamingClient from amazon_transcribe.handlers import TranscriptResultStreamHandler class Handler(TranscriptResultStreamHandler): async def handle_transcript_event(self, event): for result in event.transcript.results: for alt in result.alternatives: print(alt.transcript) client = TranscribeStreamingClient(region="us-east-1") stream = await client.start_stream_transcription( language_code="en-US", media_sample_rate_hz=16000, media_encoding="pcm", ) # Send audio chunks via stream.input_stream.send_audio_event(...) # Receive results via Handler ``` ``` # After (Telnyx) — plain WebSocket wss://api.telnyx.com/v2/speech-to-text/transcription?transcription_engine=Deepgram&model=nova-3&language=en-US&input_format=linear16&sample_rate=16000 Authorization: Bearer TELNYX_KEY ``` Send binary audio frames, receive JSON transcripts. No AWS SDK, no SigV4 signing, no IAM. See the [Quickstart](/docs/voice/stt/getting-started) for full code. **Config mapping** | AWS Transcribe Streaming | Telnyx query parameter | |---|---| | `language_code` | `language` | | `media_encoding` | `input_format` | | `media_sample_rate_hz` | `sample_rate` | | `enable_partial_results_stabilization` | `interim_results` | | `vocabulary_name` | `keyterm` (Nova-3/Flux) — see [Keyword Boosting](/docs/voice/stt/websocket-streaming/parameters/keyword-boosting) | **Response field mapping** | AWS Transcribe Streaming | → Telnyx | |---|---| | `transcript.results[].alternatives[].transcript` | `transcript` | | `transcript.results[].is_partial` | `is_final` (inverted) | ### WebSocket The [Azure Speech SDK](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-sdk) wraps a region-specific WebSocket with SDK abstractions and event handlers. Telnyx uses a plain WebSocket — no SDK install, no region routing. ```python # Before (Azure Speech SDK) import azure.cognitiveservices.speech as speechsdk speech_config = speechsdk.SpeechConfig( subscription="AZURE_KEY", region="eastus", ) speech_config.speech_recognition_language = "en-US" audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True) recognizer = speechsdk.SpeechRecognizer( speech_config=speech_config, audio_config=audio_config, ) recognizer.recognizing.connect(lambda evt: print(f"partial: {evt.result.text}")) recognizer.recognized.connect(lambda evt: print(f"FINAL: {evt.result.text}")) recognizer.start_continuous_recognition() ``` ``` # After (Telnyx) — plain WebSocket wss://api.telnyx.com/v2/speech-to-text/transcription?transcription_engine=Azure&language=en-US&interim_results=true Authorization: Bearer TELNYX_KEY ``` Send binary audio frames, receive JSON transcripts. No SDK install, no region routing. See the [Quickstart](/docs/voice/stt/getting-started) for full code. **Config mapping** | Azure Speech SDK | Telnyx query parameter | |---|---| | `speech_recognition_language` | `language` | | `recognizing` event | partial result (`is_final: false`) | | `recognized` event | final result (`is_final: true`) | | `region` | not needed — single global endpoint | **Response field mapping** | Azure Speech SDK | → Telnyx | |---|---| | `evt.result.text` (recognizing/recognized) | `transcript` | | `recognizing` event | `is_final: false` | | `recognized` event | `is_final: true` | **What you drop:** Azure Speech SDK install, region routing, event-handler boilerplate, subscription key management. ### REST ```bash # Before (Azure) curl -X POST \ "https://eastus.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US" \ -H "Ocp-Apim-Subscription-Key: AZURE_KEY" \ -H "Content-Type: audio/wav" \ --data-binary @audio.wav # After (Telnyx) curl -X POST https://api.telnyx.com/v2/ai/audio/transcriptions \ -H "Authorization: Bearer TELNYX_KEY" \ -F model="openai/whisper-large-v3-turbo" \ -F file=@audio.wav ``` **What changes** | | Azure | Telnyx | |---|---|---| | Auth | `Ocp-Apim-Subscription-Key` header | `Authorization: Bearer` | | URL | region-specific | single global endpoint | | Body | raw binary | `multipart/form-data` | | Language | required query param | auto-detected or optional | **Response field mapping** | Azure | → Telnyx | |---|---| | `DisplayText` | `text` | | `NBest[0].Lexical` | `text` | --- ## WebSocket ### Lifecycle > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming.md Real-time speech-to-text over a persistent WebSocket connection. Send audio, receive transcripts. ## Endpoint ``` wss://api.telnyx.com/v2/speech-to-text/transcription ``` ## Connection Lifecycle ### 1. Handshake The connection starts as an HTTP GET with `Upgrade: websocket`. The server responds with `101 Switching Protocols`, then the connection upgrades to WebSocket frames. ``` GET /v2/speech-to-text/transcription?transcription_engine=Deepgram&model=nova-3&input_format=wav HTTP/1.1 Host: api.telnyx.com Upgrade: websocket Connection: Upgrade Authorization: Bearer YOUR_API_KEY Sec-WebSocket-Version: 13 Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ== ``` ``` HTTP/1.1 101 Switching Protocols Upgrade: websocket Connection: Upgrade Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo= ``` You can also connect directly to the WebSocket endpoint without an HTTP upgrade: ``` wss://transcription.telnyx.com/public/speech-to-text/transcription?transcription_engine=Deepgram&model=nova-3&input_format=wav ``` The same query parameters apply. Once connected, the message protocol is identical. All configuration is set at connect time via query parameters — engine, model, format, language, options. Cannot be changed mid-session. See [Parameters](/docs/voice/stt/websocket-streaming/parameters) for the full list. Invalid parameters return a JSON error and the connection closes. ### 2. Streaming Once connected, audio and transcription flow concurrently — no request/response pairing. **Client → Server** | Frame type | Content | |-----------|---------| | binary | Audio data — raw bytes, chunked. No base64 or JSON wrapping. | | text | `{"type": "Finalize"}` — flush buffer, force final transcript (Deepgram only) | | text | `{"type": "CloseStream"}` — flush remaining transcription and close the stream gracefully (Deepgram, Speechmatics, Soniox) | | text | `{"type": "KeepAlive"}` — reset idle timeout (Deepgram only) | **Server → Client** | Message | Description | |---------|-------------| | Transcription result | `{"transcript": "...", "is_final": true, "confidence": 0.98}` | | Utterance end | `{"transcript": "", "is_final": true, "utterance_end": true}` (Deepgram) | | Error | `{"errors": [...]}` — connection closes after | See [Messages](/docs/voice/stt/websocket-streaming/responses) for the complete wire protocol reference. ``` Client → Server binary: audio chunk Client → Server binary: audio chunk Client ← Server {"transcript":"Hello","is_final":false} Client → Server binary: audio chunk Client ← Server {"transcript":"Hello, how are you?","is_final":true} ``` ### 3. Teardown Send `{"type": "CloseStream"}` (Deepgram, Speechmatics, and Soniox) to flush remaining audio and close gracefully. The server finishes processing, sends any remaining transcripts, then closes the WebSocket. ``` Client → Server {"type":"CloseStream"} Client ← Server final transcript Client ← Server [connection closed] ``` For other engines, close the WebSocket connection directly. Dropping the connection without `CloseStream` works but may lose buffered audio on Deepgram, Speechmatics, and Soniox. See [Examples](/docs/voice/stt/websocket-streaming/examples) for complete code samples. --- ### Overview > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/parameters.md All parameters are passed as query string values on the WebSocket URL. Locked at connection time — cannot be changed mid-session. `Deepgram`, `Telnyx`, `Google`, `Azure`, `xAI`, `AssemblyAI`, `Speechmatics`, `Soniox` See [Engines & Models](/docs/voice/stt/websocket-streaming/parameters/engines-and-models) See [Audio Formats](/docs/voice/stt/websocket-streaming/parameters/audio-formats) Hz. Required for raw encodings (`linear16`, `mulaw`, `alaw`). Ignored for container formats. Invalid value returns [error 40005](/docs/voice/stt/websocket-streaming/errors). Language code or auto-detect behavior varies by engine. See [Language](/docs/voice/stt/websocket-streaming/parameters/language) for details. `"true"` for partial transcripts. See [Interim Results](/docs/voice/stt/websocket-streaming/parameters/interim-results) for details. Silence detection in ms for engines that support endpointing. `"false"` to disable. See [Endpointing](/docs/voice/stt/websocket-streaming/parameters/endpointing) for details. PII redaction. Deepgram only. See [Redaction](/docs/voice/stt/websocket-streaming/parameters/redaction) for details. Comma-separated boost terms. Deepgram Nova-3/Flux. See [Keyword Boosting](/docs/voice/stt/websocket-streaming/parameters/keyword-boosting) for details. Legacy keyword boosting with intensifiers. Deepgram Nova. See [Keyword Boosting](/docs/voice/stt/websocket-streaming/parameters/keyword-boosting) for details. Azure Speech Services region. Flux only. Confidence threshold for `EndOfTurn` events. See [End-of-Turn Detection](/docs/voice/stt/websocket-streaming/parameters/end-of-turn) for details. Flux only. Max silence before forcing `EndOfTurn`. See [End-of-Turn Detection](/docs/voice/stt/websocket-streaming/parameters/end-of-turn) for details. Flux only. Speculative `EagerEndOfTurn` threshold. Disabled by default. See [End-of-Turn Detection](/docs/voice/stt/websocket-streaming/parameters/end-of-turn) for details. ## Example ``` wss://api.telnyx.com/v2/speech-to-text/transcription?transcription_engine=Deepgram&model=nova-3&input_format=wav&language=en-US&interim_results=true ``` --- ### Audio Formats > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/parameters/audio-formats.md Set via the `input_format` query parameter. Audio is sent as binary WebSocket frames -- chunked bytes, no base64, no JSON wrapping. Container formats (mp3, webm, etc.) are self-describing: the server demuxes the byte stream and extracts encoding/sample rate from headers. Raw formats have no metadata, so you must set `sample_rate` explicitly. Works for both real-time capture (microphone, MediaRecorder, telephony bridge) and file streaming (read a file in chunks, push through the socket). ## Browser Capture Output from `MediaRecorder` or similar browser APIs. Container headers carry sample rate. ``` wss://api.telnyx.com/v2/speech-to-text/transcription?input_format=webm_opus ``` | Format | Sample rate | Notes | |--------|------------|-------| | `webm` | from header | WebM container | | `webm_opus` | from header | WebM + Opus. Valid: 8000–48000. Alias: `webm-opus` | | `ogg_opus` | from header | Ogg + Opus. Valid: 8000–48000. Alias: `ogg-opus` | | `ogg` | from header | Ogg container (Vorbis or other) | ## Telephony Codecs from voice networks. Raw frames, `sample_rate` required. ``` wss://api.telnyx.com/v2/speech-to-text/transcription?input_format=mulaw&sample_rate=8000 ``` | Format | Sample rate | Notes | |--------|------------|-------| | `mulaw` | any | G.711 µ-law. North America. Default: 8000 Hz. | | `alaw` | any | G.711 A-law. EU/international. Default: 8000 Hz. | | `g729` | 8000 | G.729. Fixed. | | `amr_nb` | 8000 | AMR narrowband. Fixed. Alias: `amr-nb` | | `amr_wb` | 16000 | AMR wideband. Fixed. Alias: `amr-wb` | | `speex` | 8000, 16000, 32000 | Google: 16000 only. | Invalid sample rate returns [error 40005](/docs/voice/stt/websocket-streaming/errors). ## Raw PCM Uncompressed audio from microphones, processing pipelines, or SDKs. `sample_rate` required. ``` wss://api.telnyx.com/v2/speech-to-text/transcription?input_format=linear16&sample_rate=16000 ``` | Format | Sample rate | Notes | |--------|------------|-------| | `linear16` | any | 16-bit signed PCM, little-endian (s16le). Default: 16000 Hz. | | `linear32` | any | 32-bit float PCM, little-endian (f32le). Default: 16000 Hz. | | `opus` | 8000, 12000, 16000, 24000, 48000 | Raw Opus frames, no container. Deepgram also: 44100. | Invalid sample rate returns [error 40005](/docs/voice/stt/websocket-streaming/errors). ## Recorded File Pre-recorded files read in chunks and streamed through the socket. Container headers carry sample rate. ``` wss://api.telnyx.com/v2/speech-to-text/transcription?input_format=mp3 ``` | Format | Sample rate | Notes | |--------|------------|-------| | `mp3` | from header | Default for most engines | | `wav` | from header | Uncompressed. Default for Flux model. | | `flac` | from header | Lossless compression | ## Engine Compatibility Unsupported format/engine combination returns [error 40002](/docs/voice/stt/websocket-streaming/errors). Unsupported Flux format returns [error 40006](/docs/voice/stt/websocket-streaming/errors). Deepgram has three model generations with different format support. Flux is the most restrictive — it drops `mp3`, `flac`, `webm_opus`, `amr_nb`, `amr_wb`, `g729`, and `speex` compared to Nova. | Format | Deepgram Nova | Deepgram Flux | Telnyx | Google | Azure | Speechmatics | Soniox | |--------|:------------:|:------------:|:------:|:------:|:-----:|:------------:|:------:| | mp3 | ✓ | | ✓ | ✓ | ✓ | ✓ | ✓ | | wav | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | webm | ✓ | ✓ | | | | | ✓ | | ogg | ✓ | ✓ | | | | ✓ | ✓ | | flac | ✓ | | | ✓ | | ✓ | ✓ | | ogg_opus | ✓ | ✓ | | ✓ | | | | | webm_opus | ✓ | | | ✓ | | | | | linear16 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | | linear32 | ✓ | ✓ | ✓ | | | ✓ | ✓ | | mulaw | ✓ | ✓ | | ✓ | | ✓ | ✓ | | alaw | ✓ | ✓ | | | | | ✓ | | opus | ✓ | ✓ | | | | | | | amr_nb | ✓ | | | ✓ | | | | | amr_wb | ✓ | | | ✓ | | | | | g729 | ✓ | | | | | | | | speex | ✓ | | | ✓ | | | | Universal formats (all engines and models): `wav`, `linear16`. --- ### Engines & Models > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/parameters/engines-and-models.md Set via `transcription_engine` and `model` query parameters. ## Engines | Engine | Default model | Other models | Notes | |--------|---------------|-------------|-------| | **Deepgram** | `nova-3` | `nova-2`, `flux` | Default engine. Broadest format support. | | **Telnyx** | `openai/whisper-tiny` | — | On-network, lightweight | | **Google** | `latest_long` | — | Multilingual, long-form | | **Azure** | `azure/fast` | — | Broad language/accent coverage | | **xAI** | `xai/grok-stt` | — | Grok STT for real-time transcription | | **AssemblyAI** | `assemblyai/universal-streaming` | — | Universal-Streaming for low-latency voice agents | | **Speechmatics** | `speechmatics/standard` | — | High-accuracy real-time transcription with multilingual and bilingual packs | | **Soniox** | `soniox/stt-rt-v4` | — | Real-time transcription with automatic language detection | ## Flux Model Deepgram's lowest-latency model with built-in [end-of-turn detection](/docs/voice/stt/websocket-streaming/parameters/end-of-turn). Designed for real-time voice agents. See [Audio Formats](/docs/voice/stt/websocket-streaming/parameters/audio-formats) for supported formats. --- ### End-of-Turn Detection > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/parameters/end-of-turn.md **Deepgram Flux only.** These parameters return a 400 error on non-Flux models. Flux uses a confidence-based system to decide when a speaker has finished their turn. Confidence threshold (`0.5`–`0.9`) for triggering an `EndOfTurn` event. Higher values require more certainty the speaker is done — fewer false positives but slightly more latency. Lower values respond faster but may cut speakers off mid-thought. Confidence threshold (`0.3`–`0.9`) for triggering an early `EagerEndOfTurn` event. Not set by default — setting it enables eager mode. When fired, your agent can start generating a response speculatively. If the speaker resumes, a `TurnResumed` event cancels it. Must be ≤ `eot_threshold`. Lower values = earlier triggers, more false starts. Typical range: `0.3`–`0.5` for ~150–250 ms latency savings at the cost of ~50–70% more LLM calls. Maximum silence in ms (`500`–`10000`) before forcing `EndOfTurn` regardless of confidence. Resets when speech resumes. Increase for speakers who pause frequently; decrease for rapid-fire Q&A. ## Event Flow Without eager mode (`eot_threshold` only): ``` Speech → silence → confidence ≥ eot_threshold → EndOfTurn Speech → silence → timeout (eot_timeout_ms) → EndOfTurn ``` With eager mode (`eager_eot_threshold` set): ``` Speech → silence → confidence ≥ eager_eot_threshold → EagerEndOfTurn → speaker stays silent → confidence ≥ eot_threshold → EndOfTurn → speaker resumes → TurnResumed (cancel speculative work) ``` ## Configuration Profiles **Default** — balanced for general use: ``` ?eot_threshold=0.7&eot_timeout_ms=5000 ``` **Low-latency** — fast response, more false starts: ``` ?eager_eot_threshold=0.4&eot_threshold=0.7&eot_timeout_ms=6000 ``` **High-reliability** — fewer interruptions, more latency: ``` ?eot_threshold=0.85&eot_timeout_ms=8000 ``` --- ### Language > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/parameters/language.md BCP-47 language code. Default: `en-US`. ``` wss://api.telnyx.com/v2/speech-to-text/transcription?language=es ``` ## Auto-Detection Pass `multi` to enable automatic language detection. The aliases `auto` and `auto_detect` are silently mapped to `multi`. ``` ?language=multi ?language=auto # → multi ?language=auto_detect # → multi ``` ## Engine Support | Engine | Behavior | |--------|----------| | Deepgram | BCP-47 codes. `multi` for multi-language mode. | | Telnyx | Whisper-based. `auto_detect` disables language hint entirely. | | Google | BCP-47 codes. | | Azure | BCP-47 codes. | | xAI | Language codes such as `en`, `fr`, `de`, and `ja`. | | AssemblyAI | Automatic multilingual detection and code switching across supported languages. | | Speechmatics | Language codes such as `en`, `es`. Bilingual packs use Telnyx shorthand (e.g. `ar_en`) — mapped internally. Does not support `auto`; defaults to `en` if unrecognized. | | Soniox | Automatic language detection. The `language` parameter is ignored — Soniox detects the language from the audio stream. | ## Supported Languages For most engines, Telnyx passes the language code directly without validation. The supported set depends on which engine you use. Speechmatics is the exception — Telnyx accepts shorthand codes for bilingual/multilingual packs and maps them to the provider's `language` + `domain` configuration internally. | Engine | Languages | Reference | |--------|-----------|-----------| | Deepgram | 40+ languages across Deepgram models. Nova-3 supports `multi` (10 languages in code-switching mode). Flux supports English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch. | [Deepgram languages](https://developers.deepgram.com/docs/models-languages-overview) | | Telnyx | Whisper-based. 50+ languages. `auto_detect` to skip language hint. | — | | Google | 125+ languages/locales. | [Google Cloud STT languages](https://cloud.google.com/speech-to-text/docs/speech-to-text-supported-languages) | | Azure | 100+ languages/locales. | [Azure Speech languages](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt) | | xAI | 25 languages, including Arabic, English, French, German, Hindi, Japanese, Korean, Portuguese, Spanish, and Vietnamese. | [xAI Voice API](https://docs.x.ai/developers/rest-api-reference/inference/voice) | | AssemblyAI | 6 languages with native multilingual code switching: English, Spanish, German, French, Portuguese, and Italian. | [AssemblyAI supported languages](https://www.assemblyai.com/docs/streaming/universal-3-pro/supported-languages) | | Speechmatics | 17+ languages. Standard codes (`en`, `es`, `cy`, `sw`, etc.) plus Telnyx shorthand for bilingual/multilingual packs (`ar_en`, `cmn_en`, `en_ms`, `en_ta`, `cmn_en_ms_ta`). Telnyx maps these internally to Speechmatics `language` + `domain` params — do not pass them raw to the provider. | [Speechmatics languages](https://docs.speechmatics.com/introduction/supported-languages) | | Soniox | Automatic language detection — no language hint required. The `language` parameter is ignored. | [Soniox docs](https://soniox.com/docs) | ## Common Codes | Code | Language | |------|----------| | `en` or `en-US` | English (US) | | `en-GB` | English (UK) | | `es` | Spanish | | `fr` | French | | `de` | German | | `pt-BR` | Portuguese (Brazil) | | `it` | Italian | | `ja` | Japanese | | `zh` | Chinese (Mandarin) | | `hi` | Hindi | | `ar` | Arabic | | `ko` | Korean | | `multi` | Multi-language / auto-detect (Deepgram) | --- ### Interim Results > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/parameters/interim-results.md **Deepgram, Speechmatics, and Soniox.** Other engines ignore this parameter. Controls whether the server sends partial (non-final) transcripts as speech is processed. ``` wss://api.telnyx.com/v2/speech-to-text/transcription?interim_results=true ``` Default: `false`. Pass `"true"` as a string. **`interim_results=false`** (default) — Server sends only final transcripts. Each message has `is_final: true`. Lower message volume, higher latency per result. ```json {"transcript": "Hello, how are you?", "is_final": true, "confidence": 0.98} ``` **`interim_results=true`** — Server sends evolving partial transcripts as audio is processed, followed by a final. Partials have `is_final: false` and are replaced by the next message. ```json {"transcript": "Hello", "is_final": false, "confidence": 0.0} {"transcript": "Hello, how", "is_final": false, "confidence": 0.0} {"transcript": "Hello, how are you?", "is_final": true, "confidence": 0.98} ``` Partial transcripts have `confidence: 0.0` — confidence is only meaningful on final results. --- ### Endpointing > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/parameters/endpointing.md **Deepgram, xAI, Google, Speechmatics, and Soniox.** Other engines ignore this parameter. Controls how long the engine waits after silence before finalizing an utterance. **Soniox has a different valid range.** When `transcription_engine=Soniox`, this parameter maps to `max_endpoint_delay_ms` and must be between **500 and 3000 ms**. Values outside that range are rejected. The default (100 ms) and the low-value examples below apply to Deepgram, xAI, Google, and Speechmatics only. ``` # Deepgram / xAI / Google / Speechmatics wss://api.telnyx.com/v2/speech-to-text/transcription?endpointing=300 # Soniox (500–3000 ms) wss://api.telnyx.com/v2/speech-to-text/transcription?transcription_engine=Soniox&endpointing=1000 ``` Default: `100` ms (not applicable to Soniox — Soniox endpointing is disabled unless a value in the 500–3000 ms range is provided). ## Values | Value | Behavior | |-------|----------| | Integer (ms) | Finalize after this many ms of silence. Lower = faster but more splits. | | `"false"` | Disable endpointing entirely. No automatic utterance boundaries. | ## Trade-offs **Low values (50–100 ms)** — Fast response. Utterances may split mid-sentence on short pauses. *(Deepgram, xAI, Google, Speechmatics only — below Soniox minimum.)* **High values (300–1000 ms)** — More complete sentences. Higher latency before finalization. **Soniox range (500–3000 ms)** — Minimum 500 ms. Use 500–800 ms for responsive turn detection, 1000–3000 ms for longer utterances with natural pauses. **Disabled (`"false"`)** — No automatic splits. Use `Finalize` control messages to manually trigger boundaries, or rely on `CloseStream` for a single final transcript. ## Interaction With Utterance End When endpointing triggers, Deepgram sends the final transcript followed by an utterance end event (if `utterance_end_ms` is configured server-side — currently 1000 ms). ```json {"transcript": "Hello, how are you?", "is_final": true} {"transcript": "", "is_final": true, "utterance_end": true} ``` The utterance end marker signals "this speaker turn is done." See [Messages](/docs/voice/stt/websocket-streaming/responses) for details. --- ### Keyword Boosting > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/parameters/keyword-boosting.md **Deepgram only.** Other engines ignore these parameters. Two parameters control keyword boosting. They target different Deepgram model generations. ## `keyterm` — Nova-3 and Flux Comma-separated list of terms to boost. Simple — no intensifiers. ``` ?keyterm=Telnyx,WebRTC,SIP ``` Deepgram Nova-3 and Flux only. Ignored on older models. ## `keywords` — Nova (Legacy) Terms with optional intensity scores. Format: `keyword:intensifier`. ``` ?keywords=Telnyx:2 ``` Deepgram Nova only. Not supported on Flux (silently ignored). ## Which To Use | Model | Parameter | |-------|-----------| | Flux | `keyterm` | | Nova-3 | `keyterm` | | Nova | `keywords` | | Nova-2 | `keywords` | ## Examples Boost multiple terms on Nova-3: ``` ?transcription_engine=Deepgram&model=nova-3&keyterm=Telnyx,SIP,RTP,WebRTC ``` Boost with intensifiers on legacy Nova: ``` ?transcription_engine=Deepgram&model=nova&keywords=Telnyx:2&keywords=telephony:1 ``` --- ### Redaction > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/parameters/redaction.md **Deepgram only.** Other engines ignore this parameter. Replaces sensitive data in transcripts with placeholder text. ``` wss://api.telnyx.com/v2/speech-to-text/transcription?redact=pci ``` ## Values | Value | Redacts | |-------|---------| | `pci` | Credit card numbers | | `ssn` | Social Security numbers | | `numbers` | All numeric sequences | Multiple values can be passed as comma-separated: ``` ?redact=pci,ssn ``` ## Output Redacted content is replaced in the transcript text. The exact replacement format depends on the Deepgram model. ```json {"transcript": "My card number is [REDACTED]", "is_final": true} ``` Redaction applies to final and interim results. There is no way to get the un-redacted version once redaction is enabled for a session. --- ### Messages > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/responses.md The WebSocket carries two frame types: binary frames (audio) from client to server, and JSON text frames in both directions. ## Client → Server ### Audio Data Binary WebSocket frames containing raw audio bytes. No base64, no JSON wrapping. Recommended chunk size: 2048–8192 bytes. Smaller chunks reduce latency; larger chunks reduce round trips. ``` [binary frame: audio bytes] ``` ### Control Messages JSON text frames with a `type` field. ```json Finalize {"type": "Finalize"} ``` ```json CloseStream {"type": "CloseStream"} ``` ```json KeepAlive {"type": "KeepAlive"} ``` | Type | Effect | Engine support | |------|--------|---------------| | `Finalize` | Flush audio buffer, force a final transcript | Deepgram only | | `CloseStream` | End session, close connection gracefully | Deepgram, Speechmatics, Soniox | | `KeepAlive` | Reset idle timeout | Deepgram only | Unknown text frames are silently ignored. --- ## Server → Client All server messages are JSON text frames. ### Transcription Result Emitted for each recognized speech segment (partial or final). ```json { "transcript": "Hello, how are you today?", "is_final": true, "speech_final": true, "confidence": 0.98 } ``` | Field | Type | Present | Description | |-------|------|---------|-------------| | `transcript` | string | Always | Transcribed text | | `is_final` | boolean | Always | `true` = finalized segment. `false` = interim (may revise). | | `speech_final` | boolean | Deepgram | `true` = speaker stopped talking | | `confidence` | float | When available | 0.0–1.0 confidence score | | `utterance_end` | boolean | Deepgram | `true` = silence-triggered utterance boundary | ### Utterance End Emitted on speaker pause (Deepgram). Empty transcript, `is_final: true`. ```json { "transcript": "", "is_final": true, "utterance_end": true } ``` ### Error Emitted on validation or connection errors. Connection closes shortly after. ```json { "errors": [ { "code": "40002", "title": "Unsupported format", "detail": "Format 'flac' is not supported by engine 'Azure'", "source": {"parameter": "input_format"} } ] } ``` | Field | Type | Description | |-------|------|-------------| | `errors` | array | One or more error objects | | `errors[].code` | string | Error code (see [Errors](/docs/voice/stt/websocket-streaming/errors)) | | `errors[].title` | string | Short description | | `errors[].detail` | string | Human-readable explanation | | `errors[].source.parameter` | string | Query parameter that caused the error | --- ## Message Flow **`interim_results=false`** (default) — server sends only final transcripts: ``` Client: [binary audio frames] Server: {"transcript": "Hello, how are you today?", "is_final": true, "speech_final": true, "confidence": 0.98} Client: [binary audio frames] Server: {"transcript": "I'm doing well.", "is_final": true, "speech_final": true, "confidence": 0.95} Client: {"type": "CloseStream"} [connection closed] ``` **`interim_results=true`** — server sends partials, then final: ``` Client: [binary audio frames] Server: {"transcript": "Hello", "is_final": false, "speech_final": false} Server: {"transcript": "Hello, how are", "is_final": false, "speech_final": false} Server: {"transcript": "Hello, how are you today?", "is_final": true, "speech_final": true, "confidence": 0.98} ``` Partials are best-effort and may revise. Only `is_final: true` results are stable. --- ### Errors > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/errors.md If you send an invalid parameter (unsupported engine, format, or format/engine combination), the server responds with a JSON error and closes the connection. ## Error Response Format ```json { "errors": [{ "code": "40001", "title": "Invalid Parameter", "detail": "Unsupported input_format 'aac'. Supported formats: mp3, wav, webm, ogg, flac, ogg_opus, webm_opus, linear16, linear32, mulaw, alaw, opus, amr_nb, amr_wb, g729, speex", "source": { "parameter": "input_format" } }] } ``` ## Error Codes | Code | Meaning | |------|---------| | `40001` | Invalid `input_format` value | | `40002` | Format not supported by the chosen engine | | `40003` | `sample_rate` required but missing (raw encoding or Google with non-WAV/FLAC) | | `40004` | `sample_rate` is not a valid positive integer | | `40005` | Invalid sample rate for the codec (e.g., `amr_nb` only supports 8000) | | `40006` | Format not supported by Flux model | | `40007` | Invalid `transcription_engine` value | --- ### Examples > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/examples.md Stream a WAV file and print transcripts. ```python Python import asyncio import json import websockets API_KEY = "YOUR_API_KEY" AUDIO_FILE = "audio.wav" async def transcribe(): url = ( "wss://api.telnyx.com/v2/speech-to-text/transcription" "?transcription_engine=Deepgram" "&model=nova-3" "&input_format=wav" "&interim_results=true" ) headers = {"Authorization": f"Bearer {API_KEY}"} async with websockets.connect(url, extra_headers=headers) as ws: async def listen(): async for message in ws: data = json.loads(message) prefix = "FINAL" if data.get("is_final") else "partial" print(f"[{prefix}] {data.get('transcript', '')}") listener = asyncio.create_task(listen()) with open(AUDIO_FILE, "rb") as f: while chunk := f.read(4096): await ws.send(chunk) await asyncio.sleep(0.05) await asyncio.sleep(3) await ws.send(json.dumps({"type": "CloseStream"})) listener.cancel() asyncio.run(transcribe()) ``` ```javascript JavaScript const WebSocket = require("ws"); const fs = require("fs"); const API_KEY = "YOUR_API_KEY"; const AUDIO_FILE = "audio.wav"; const url = new URL("wss://api.telnyx.com/v2/speech-to-text/transcription"); url.searchParams.set("transcription_engine", "Deepgram"); url.searchParams.set("model", "nova-3"); url.searchParams.set("input_format", "wav"); url.searchParams.set("interim_results", "true"); const ws = new WebSocket(url.toString(), { headers: { Authorization: `Bearer ${API_KEY}` }, }); ws.on("open", () => { const audio = fs.readFileSync(AUDIO_FILE); for (let i = 0; i < audio.length; i += 4096) { ws.send(audio.slice(i, i + 4096)); } setTimeout(() => { ws.send(JSON.stringify({ type: "CloseStream" })); ws.close(); }, 3000); }); ws.on("message", (data) => { const msg = JSON.parse(data); const prefix = msg.is_final ? "FINAL" : "partial"; console.log(`[${prefix}] ${msg.transcript || ""}`); }); ws.on("error", (err) => console.error("Error:", err.message)); ``` --- ### Production Patterns > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/production-patterns.md Use these patterns when running the standalone WebSocket STT endpoint in production. ## Connection Recovery Treat the WebSocket session as disposable. Reconnect on network failure, server close, idle timeout, and process restart. | Event | Action | |-------|--------| | Connection fails before `open` | Retry with backoff. Do not send audio until the connection is open. | | Connection closes unexpectedly | Stop sending audio, preserve buffered audio, reconnect, then resume streaming. | | Error message received | Log `errors[].code`, `errors[].title`, and `errors[].source.parameter`. Reconnect only after fixing parameter errors. | | Graceful shutdown | Send `{"type": "CloseStream"}` and wait for final transcripts before closing the socket. | Set all query parameters on every reconnect. STT configuration cannot be changed mid-session. ## Backoff Use bounded exponential backoff with jitter. | Attempt | Base delay | |---------|------------| | 1 | 250 ms | | 2 | 500 ms | | 3 | 1 s | | 4 | 2 s | | 5+ | 5 s max | Add random jitter of 0-500 ms per attempt. Reset the attempt counter after a stable connection. Do not retry immediately on authentication or validation errors. Fix the API key, query parameters, engine, model, or format first. ## Partials Enable `interim_results=true` when the application needs live captions or low-latency UI updates. | Message | Handling | |---------|----------| | `is_final: false` | Display as temporary text. Replace it when a newer partial arrives. Do not persist it as final transcript. | | `is_final: true` | Commit to the transcript. Do not replace it with later partials. | | `utterance_end: true` | Treat as a segment boundary. Do not render an empty transcript as text. | Store final transcript segments separately from the current partial. This prevents duplicate text when a final result arrives after one or more interim results. ## Audio Buffering Buffer audio at the producer boundary, not inside the WebSocket send loop. | Control | Recommendation | |---------|----------------| | Chunk size | Send 2048-8192 byte binary frames. | | Queue size | Set a maximum buffered duration, such as 5-10 seconds. | | Backpressure | Pause or drop low-priority audio when the queue is full. | | Reconnect | Keep a short rolling buffer only if retranscription after reconnect is required. | Avoid unbounded queues. A slow or disconnected socket should not grow memory usage indefinitely. For live audio, prefer dropping stale buffered audio over sending it late. Late audio increases transcript delay and can make captions appear out of sync. ## Keepalive For Deepgram sessions, send `{"type": "KeepAlive"}` during long silence periods. Keep sending audio as binary frames when audio is available. For other engines, use the WebSocket client's ping/pong support when available and reconnect on missed heartbeats. ## Monitoring Track connection, latency, transcript, and buffer metrics. | Metric | Purpose | |--------|---------| | Connection attempts | Detect retry loops and regional network issues. | | Connection duration | Detect unstable sessions and idle timeout patterns. | | Close code and reason | Separate expected closes from failures. | | Error codes | Identify invalid parameters and engine compatibility issues. | | Audio queue depth | Detect send-loop backpressure. | | Partial-to-final latency | Measure caption freshness. | | Final transcript count | Detect stalled recognition. | | Empty final count | Detect silence segmentation behavior. | Log the selected `transcription_engine`, `model`, `input_format`, `sample_rate`, and `interim_results` value with each session. Redact API keys and user audio. ## Shutdown Use graceful shutdown for planned stops. 1. Drain the audio queue. 2. Send `{"type": "CloseStream"}`. 3. Wait for final transcript messages. 4. Close the WebSocket. Set a shutdown timeout. If final messages do not arrive before the timeout, close the socket and mark the transcript as incomplete. --- ### Pricing > Source: https://developers.telnyx.com/docs/voice/stt/websocket-streaming/pricing.md Pricing for WebSocket STT varies by engine and model. Contact [sales](https://telnyx.com/contact-us) or check the [pricing page](https://telnyx.com/pricing/speech-to-text) for current rates. --- ## REST API ### Overview > Source: https://developers.telnyx.com/docs/voice/stt/rest-api.md `POST /v2/ai/audio/transcriptions` Synchronous file transcription. Upload audio or pass a URL, get text back. ## Feature Support If you're coming from alternative providers: | Feature | Status | |---|---| | OpenAI SDK compatible | **Yes** — swap `base_url` and `api_key`, existing code works | | Multi-engine selection | **Yes** — 3 models behind one endpoint | | File upload | **Yes** | | URL transcription | **Yes** (`file_url`) | | Timestamps (segment) | **Yes** (`verbose_json`) | | Timestamps (word-level) | **Deepgram only** (via `model_config`) | | Diarization | **Deepgram only** (via `model_config`) | | Smart formatting | **Deepgram only** (via `model_config`) | | Multilingual | **Model-dependent** — whisper-turbo: 80+ languages, whisper-tiny: 50+ languages, Deepgram models support language coverage based on the selected model | | Async / webhooks | No | | Multichannel | No (forced mono) | | Export formats (SRT/VTT) | No | | Audio event tagging | No | | YouTube/TikTok URL | No | | Transcript retrieval | No | | File size limit | 100 MB | ## Quick Start ```python Python from openai import OpenAI client = OpenAI( api_key="YOUR_TELNYX_API_KEY", base_url="https://api.telnyx.com/v2", ) result = client.audio.transcriptions.create( model="openai/whisper-large-v3-turbo", file=open("audio.mp3", "rb"), ) ``` ```javascript JavaScript import OpenAI from "openai"; import fs from "fs"; const client = new OpenAI({ apiKey: "YOUR_TELNYX_API_KEY", baseURL: "https://api.telnyx.com/v2", }); const result = await client.audio.transcriptions.create({ model: "openai/whisper-large-v3-turbo", file: fs.createReadStream("audio.mp3"), }); ``` --- ### Overview > Source: https://developers.telnyx.com/docs/voice/stt/rest-api/parameters.md All parameters are sent as `multipart/form-data`. Model to use for transcription. See [Models](/docs/voice/stt/rest-api/parameters/models) for details. **Values:** `openai/whisper-large-v3-turbo` (default), `openai/whisper-tiny`, `deepgram/nova-3` Audio file to transcribe. Mutually exclusive with `file_url`. See [Audio Formats](/docs/voice/stt/rest-api/parameters/audio-formats) for supported formats, size limits, and per-model restrictions. Publicly accessible URL to an audio file. Mutually exclusive with `file`. See [Audio Formats](/docs/voice/stt/rest-api/parameters/audio-formats) for details on how `file` and `file_url` differ. Language hint. Behavior varies by model — see [Language](/docs/voice/stt/rest-api/parameters/language). Output shape. See [Response Format](/docs/voice/stt/rest-api/response). **Values:** `json` (default), `verbose_json` Timestamp detail level. Only valid with `response_format=verbose_json` — returns 400 otherwise. **Values:** `segment` Deepgram-specific options. Only valid with `deepgram/nova-3` — returns 400 for other models. See [Model Config](/docs/voice/stt/rest-api/parameters/model-config). --- ### Models > Source: https://developers.telnyx.com/docs/voice/stt/rest-api/parameters/models.md Your choice of `model` determines which audio formats are accepted, what `language` values are valid, and what response fields are available. | | `openai/whisper-large-v3-turbo` | `openai/whisper-tiny` | `deepgram/nova-3` | |---|---|---|---| | **Default** | Yes | | | | **Audio formats** | All 10 | All 10 | mp3, wav only | | **Language** | 80+ languages, auto-detected | 50+ languages, auto-detected | English variants only (`en`, `en-US`, `en-GB`, `en-AU`, `en-NZ`, `en-IN`) | | **Timestamps** | No | No | Word-level (via `model_config`) | | **Diarization** | No | No | Yes (via `model_config`) | | **Smart formatting** | No | No | Yes (via `model_config`) | | **`model_config`** | Returns 400 | Returns 400 | [Deepgram pass-through](/docs/voice/stt/rest-api/parameters/model-config) | ### `openai/whisper-large-v3-turbo` Default model. Multilingual. Auto-detected if `language` omitted. See [Whisper docs](https://github.com/openai/whisper#available-models-and-languages) for the full language list. Returns text only — no timestamps regardless of `response_format`. ### `openai/whisper-tiny` Lightweight, lowest resource usage. Multilingual (50+ languages, auto-detected). Returns text only — no timestamps. ### `deepgram/nova-3` Highest accuracy for English. Advanced features (diarization, word timestamps, smart formatting, numerals, punctuation) available via [`model_config`](/docs/voice/stt/rest-api/parameters/model-config). Defaults `language` to `en` if omitted. Can also set `language` inside `model_config` — top-level field takes precedence. See [Deepgram language docs](https://developers.deepgram.com/docs/models-languages-overview) for details. --- ### Audio Formats > Source: https://developers.telnyx.com/docs/voice/stt/rest-api/parameters/audio-formats.md Applies to both `file` (multipart upload) and `file_url` (URL download). ## Common - **Max size:** 100 MB - **Processing:** All audio is decoded, resampled to 16kHz, and mixed to mono via ffmpeg before transcription. Container format doesn't matter as long as ffmpeg can decode it — the validated extension list is the actual restriction. ## Supported Formats | Format | `whisper-turbo` | `whisper-tiny` | `deepgram/nova-3` | |---|---|---|---| | flac | Yes | Yes | No | | m4a | Yes | Yes | No | | mp3 | Yes | Yes | **Yes** | | mp4 | Yes | Yes | No | | mpeg | Yes | Yes | No | | mpga | Yes | Yes | No | | oga | Yes | Yes | No | | ogg | Yes | Yes | No | | wav | Yes | Yes | **Yes** | | webm | Yes | Yes | No | ## `file` vs `file_url` | | `file` | `file_url` | |---|---|---| | Delivery | Multipart upload in request body | Server downloads from URL before transcription | | Timeout | Request timeout | 15s download timeout | | Auth | N/A | URL must be publicly accessible (no auth headers forwarded) | | Validation | Same format and size checks | Same format and size checks | One of `file` or `file_url` is required. Sending both returns 400. --- ### Model Config > Source: https://developers.telnyx.com/docs/voice/stt/rest-api/parameters/model-config.md Deepgram only. Returns 400 if used with other models. Pass-through to [Deepgram's pre-recorded API](https://developers.deepgram.com/docs/pre-recorded-audio) query parameters. Every key-value pair in `model_config` is forwarded directly — Telnyx does not validate individual options. ## Commonly Used Options | Option | Type | Description | |---|---|---| | `smart_format` | boolean | Capitalization, punctuation, dates, numbers, currency | | `punctuate` | boolean | Add punctuation | | `diarize` | boolean | Speaker identification. Adds `speakers` array to `verbose_json` segments. | | `utterance` | boolean | Segment transcript into utterances | | `numerals` | boolean | Convert spoken numbers to digits | | `language` | string | Override language. Top-level `language` param takes precedence. | ## Example ```bash curl -X POST https://api.telnyx.com/v2/ai/audio/transcriptions \ -H "Authorization: Bearer $TELNYX_API_KEY" \ -F "file=@call-recording.mp3" \ -F "model=deepgram/nova-3" \ -F "response_format=verbose_json" \ -F 'model_config={"smart_format": true, "diarize": true, "punctuate": true}' ``` `model_config` can be sent as a JSON string in the multipart form field. The server parses it before forwarding. ## Unvalidated Pass-Through Any Deepgram query parameter can be passed. If Deepgram adds new options, they work immediately without a Telnyx API update. Conversely, invalid keys are forwarded and may cause Deepgram to return an error. Refer to [Deepgram's API reference](https://developers.deepgram.com/reference/listen-file) for the full list of supported parameters. --- ### Response Format > Source: https://developers.telnyx.com/docs/voice/stt/rest-api/parameters/response.md Controlled by the `response_format` parameter. ## `json` (Default) Text only. ```json { "text": "The quick brown fox jumps over the lazy dog." } ``` ## `verbose_json` Adds `duration` (seconds) and timestamped `segments` — **only when using `deepgram/nova-3`**. The Whisper models (`openai/whisper-large-v3-turbo`, `openai/whisper-tiny`) return text only regardless of `response_format`. See the [Timestamp Availability](#timestamp-availability-by-model) table below. Example response with `model=deepgram/nova-3`: ```json { "text": "The quick brown fox jumps over the lazy dog.", "duration": 3.42, "segments": [ { "id": 0, "text": "The quick brown fox jumps over the lazy dog.", "start": 0.0, "end": 3.42 } ] } ``` Set `timestamp_granularities[]=segment` alongside `response_format=verbose_json`. Using `timestamp_granularities` without `verbose_json` returns 400. ## Segment Fields | Field | Type | Description | |---|---|---| | `id` | integer | Zero-indexed segment number | | `text` | string | Segment transcript | | `start` | float | Start time in seconds | | `end` | float | End time in seconds | | `words` | array | Word-level timestamps (present when the backend provides them — Deepgram only) | | `speakers` | array | Speaker labels (present when `diarize=true` in `model_config` — Deepgram only) | ## Timestamp Availability by Model | Model | `verbose_json` timestamps | |---|---| | `openai/whisper-large-v3-turbo` | **No timestamps** — backend returns text only | | `openai/whisper-tiny` | **No timestamps** — backend returns text only | | `deepgram/nova-3` | Segment-level + word-level (from Deepgram response) | ## Streaming Response (Undocumented) Sending `Accept: application/stream+json` returns newline-delimited JSON chunks as segments are transcribed. Each line: ```json {"text": "segment text", "start": 0.0, "end": 3.42} ``` This is used internally but not in the public OAS spec. --- ### Pricing > Source: https://developers.telnyx.com/docs/voice/stt/rest-api/pricing.md Pricing for REST API STT varies by engine and model. Contact [sales](https://telnyx.com/contact-us) or check the [pricing page](https://telnyx.com/pricing/speech-to-text) for current rates. --- ## In-Call ### In-Call Transcription > Source: https://developers.telnyx.com/docs/voice/stt/in-call-transcription.md In-call transcription enables real-time STT on live voice calls. The audio codec is managed by the Telnyx platform — no format configuration needed. Two integration paths: - **Voice API** — `transcription_start` / `transcription_stop` commands on active calls. See the [Voice API STT guide](/docs/voice/programmable-voice/speech-to-text#voice-api). - **TeXML** — XML-based call flow with transcription directives. See the [TeXML transcription guide](/docs/voice/programmable-voice/speech-to-text#texml). Engine selection (Telnyx, Google, Deepgram, Azure, xAI, AssemblyAI, Speechmatics, Soniox) is specified as a parameter on the transcription command. Same engines as [WebSocket streaming](/docs/voice/stt/websocket-streaming/parameters/engines-and-models). --- ## API Reference (STT) ### Audio - [Transcribe speech to text](https://developers.telnyx.com/api-reference/audio/transcribe-speech-to-text.md): Transcribe speech to text. This endpoint is consistent with the OpenAI Transcription API and may be used with the OpenAI JS or Python SDK.