Audio Generation

Mesh API provides a full suite of audio endpoints — convert text to speech, transcribe audio files, stream TTS/STT in real time, and browse available voices — all through a single API key.

All endpoints share the same base URL: https://api.meshapi.ai/v1/audio

Auth: Authorization: Bearer rsk_<your-key> on all REST requests. WebSocket endpoints accept the key via Sec-WebSocket-Protocol: Bearer <rsk_...> or ?api_key=<rsk_...>.

Text-to-Speech

POST /v1/audio/speech

Convert a text string into audio. The brand behind a model (ElevenLabs, Sarvam, etc.) is selected automatically based on the model you pass — models like hexgrad/kokoro-82m and cartesia/sonic-2 are also available. Streaming is enabled by default.

Request body

Field	Type	Default	Description
`model`	string	`eleven_flash_v2_5`	Model ID. Determines which provider handles the request.
`input`	string	—	Text to synthesize. Required.
`voice`	string	—	Voice ID/name for the selected model’s brand (an ElevenLabs voice ID, or a Kokoro/Cartesia voice). Browse valid IDs with `GET /v1/audio/voices` (filter by `brand`, `model`, or `search`). Required for ElevenLabs models.
`speaker`	string	`anushka`	Speaker name for Sarvam models. Ignored for ElevenLabs.
`stream`	boolean	`true`	Stream audio chunks as they are generated.
`response_format`	string	provider default	Output audio format (e.g. `mp3_44100_128`, `pcm_22050`, `wav_44100`).
`language_code`	string	—	BCP-47 language code (e.g. `en-US`).
`voice_settings`	object	—	Fine-tune ElevenLabs voice: `stability`, `similarity_boost`, `style`, `use_speaker_boost`, `speed`.
`seed`	integer	—	Reproducible generation seed.
`previous_text`	string	—	Text that came before `input` — used for better prosody.
`next_text`	string	—	Text that comes after `input` — used for better prosody.
`apply_text_normalization`	string	—	`auto`, `on`, or `off`. Controls ElevenLabs text normalizer.
`enable_logging`	boolean	—	Pass `false` to opt out of ElevenLabs request logging.
`optimize_streaming_latency`	integer	—	ElevenLabs latency-quality trade-off level (0–4).
`pitch`	float	—	Sarvam pitch adjustment.
`pace`	float	—	Sarvam speaking pace.
`loudness`	float	—	Sarvam output loudness.
`target_language_code`	string	`hi-IN`	Sarvam target language.

Supported output formats

Streaming (stream: true): mp3_22050_32, mp3_24000_48, mp3_44100_32/64/96/128/192, pcm_8000/16000/22050/24000/32000/44100/48000, ulaw_8000, alaw_8000, opus_48000_32/64/96/128/192

Non-streaming (stream: false): All of the above, plus wav_8000/16000/22050/24000/32000/44100/48000

Response

The response body is raw audio bytes with the Content-Type matching the requested format (e.g. audio/mpeg, audio/wav).

Examples

curl (streaming)

curl (non-streaming WAV)

Python

Node.js

$ curl -X POST https://api.meshapi.ai/v1/audio/speech \
>   -H "Authorization: Bearer rsk_..." \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "eleven_flash_v2_5",
>     "input": "Hello! This is a test of the Mesh API text-to-speech endpoint.",
>     "voice": "JBFqnCBsd6RMkjVDRZzb",
>     "response_format": "mp3_44100_128"
>   }' \
>   --output speech.mp3

WebSocket TTS Streaming

WS /v1/audio/speech/stream/{voice_id}

Stream text-to-speech in real time. You send text chunks as they become available (e.g. as an LLM streams tokens), and receive audio back chunk by chunk — minimising latency compared to the REST endpoint. The streaming frame protocol depends on the model family. The voice_id is part of the URL path.

Standard streaming models — e.g. hexgrad/kokoro-82m, cartesia/sonic-2, cartesia/sonic-3, canopylabs/orpheus-3b-0.1-ft — use the text-buffer protocol.
ElevenLabs models (elevenlabs/*) use ElevenLabs’ native stream-input frames.

Authentication

Pass your Mesh API key in one of two ways:

Sec-WebSocket-Protocol: Bearer rsk_... header
?api_key=rsk_... query parameter

Query parameters

Parameter	Default	Description
`model_id`	`eleven_flash_v2_5`	Model to use. Determines the streaming frame protocol.
`output_format`	`pcm_22050`	Audio format for streamed audio. See supported formats below.
`language_code`	—	BCP-47 language code.

ElevenLabs models additionally accept enable_logging (true/false logging opt-out), enable_ssml_parsing (enable SSML in the input text), inactivity_timeout (seconds of inactivity before the session closes, 1–180), sync_alignment (return word-level alignment data with each audio chunk), auto_mode (optimise for low-latency, fully-automated generation), apply_text_normalization (auto, on, or off), and seed (reproducible seed, 0–4294967295).

Message protocol — standard streaming models

Client → server (JSON frames)

Frame type	Fields	Description
`input_text_buffer.append`	`text: string`	Append a chunk of text to the buffer.
`input_text_buffer.commit`	`{}`	Flush the buffer and finish synthesis.

Server → client (JSON frames)

Frame type	Fields	Description
`conversation.item.audio_output.delta`	`delta: string` (base64)	Audio chunk. Decode from base64 to get raw audio bytes.
`conversation.item.audio_output.done`	`{}`	Signals all audio has been sent.

Example

1 const ws = new WebSocket(
2   "wss://api.meshapi.ai/v1/audio/speech/stream/af_bella" +
3     "?model_id=hexgrad/kokoro-82m&output_format=pcm_24000&api_key=rsk_..."
4 );
5 
6 ws.onopen = () => {
7   // 1. Append text chunks
8   ws.send(
9     JSON.stringify({
10       type: "input_text_buffer.append",
11       text: "Hello, this is streamed text-to-speech.",
12     })
13   );
14 
15   // 2. Flush + finish
16   ws.send(JSON.stringify({ type: "input_text_buffer.commit" }));
17 };
18 
19 ws.onmessage = ({ data }) => {
20   const frame = JSON.parse(data);
21   if (frame.type === "conversation.item.audio_output.delta") {
22     // Decode base64 audio and play / write to file
23     const audioBytes = atob(frame.delta);
24     console.log(`Received ${audioBytes.length} bytes of audio`);
25   }
26   if (frame.type === "conversation.item.audio_output.done") {
27     console.log("Stream complete");
28     ws.close();
29   }
30 };

Message protocol — ElevenLabs models

Client → server (JSON frames)

Frame type	Fields	Description
`initializeConnection`	`text: " "`, optional `voice_settings`, `generation_config`, `pronunciation_dictionary_locators`	Must be the first message sent.
`sendText`	`text: string`, optional `try_trigger_generation`, `voice_settings`, `flush`	Send a chunk of text to synthesize.
`closeConnection`	`text: ""`	Signal end of input and close the session cleanly.

Server → client (JSON frames)

Frame type	Fields	Description
`AudioOutput`	`audio: string` (base64), optional `alignment`	Audio chunk. Decode from base64 to get raw audio bytes.
`FinalOutput`	`isFinal: true`	Signals all audio has been sent.
`error`	`message: string`	Sent before closing on auth, format, or upstream errors.

Any credential fields (xi-api-key, authorization, api_key) in client frames are stripped before forwarding upstream — your upstream credentials are never exposed to the client.

Example

1 const ws = new WebSocket(
2   "wss://api.meshapi.ai/v1/audio/speech/stream/JBFqnCBsd6RMkjVDRZzb" +
3     "?model_id=eleven_flash_v2_5&output_format=mp3_44100_128&api_key=rsk_..."
4 );
5 
6 ws.onopen = () => {
7   // 1. Initialize the connection
8   ws.send(JSON.stringify({ text: " " }));
9 
10   // 2. Stream text chunks
11   ws.send(JSON.stringify({ text: "Hello, this is streamed text-to-speech." }));
12 
13   // 3. Close when done
14   ws.send(JSON.stringify({ text: "" }));
15 };
16 
17 ws.onmessage = ({ data }) => {
18   const frame = JSON.parse(data);
19   if (frame.audio) {
20     // Decode base64 audio and play / write to file
21     const audioBytes = atob(frame.audio);
22     console.log(`Received ${audioBytes.length} bytes of audio`);
23   }
24   if (frame.isFinal) {
25     console.log("Stream complete");
26     ws.close();
27   }
28 };

Supported output formats

ElevenLabs models: mp3_22050_32, mp3_44100_32/64/96/128/192, pcm_16000/22050/24000/44100, ulaw_8000

Standard streaming models: pcm, mp3, wav, opus, aac, flac, each with an optional sample rate (e.g. pcm_24000).