Text-to-Speech

View as Markdown

Audio Generation

Mesh API provides a full suite of audio endpoints — convert text to speech, transcribe audio files, stream TTS/STT in real time, and browse available voices — all through a single API key.

All endpoints share the same base URL: https://api.meshapi.ai/v1/audio

Auth: Authorization: Bearer rsk_<your-key> on all REST requests. WebSocket endpoints accept the key via Sec-WebSocket-Protocol: Bearer <rsk_...> or ?api_key=<rsk_...>.


Text-to-Speech

POST /v1/audio/speech

Convert a text string into audio. The provider (ElevenLabs, Sarvam, etc.) is selected automatically based on the model you pass. Streaming is enabled by default.

Request body

FieldTypeDefaultDescription
modelstringsarvam/bulbul:v3Model ID. Determines which provider handles the request.
inputstringText to synthesize. Required.
voicestringElevenLabs voice ID. Required for ElevenLabs models.
speakerstringanushkaSpeaker name for Sarvam models. Ignored for ElevenLabs.
streambooleantrueStream audio chunks as they are generated.
response_formatstringprovider defaultOutput audio format (e.g. mp3_44100_128, pcm_22050, wav_44100).
language_codestringBCP-47 language code (e.g. en-US).
voice_settingsobjectFine-tune ElevenLabs voice: stability, similarity_boost, style, use_speaker_boost, speed.
seedintegerReproducible generation seed.
previous_textstringText that came before input — used for better prosody.
next_textstringText that comes after input — used for better prosody.
apply_text_normalizationstringauto, on, or off. Controls ElevenLabs text normalizer.
enable_loggingbooleanPass false to opt out of ElevenLabs request logging.
optimize_streaming_latencyintegerElevenLabs latency-quality trade-off level (0–4).
pitchfloatSarvam pitch adjustment.
pacefloatSarvam speaking pace.
loudnessfloatSarvam output loudness.
target_language_codestringhi-INSarvam target language.

Supported output formats

Streaming (stream: true): mp3_22050_32, mp3_24000_48, mp3_44100_32/64/96/128/192, pcm_8000/16000/22050/24000/32000/44100/48000, ulaw_8000, alaw_8000, opus_48000_32/64/96/128/192

Non-streaming (stream: false): All of the above, plus wav_8000/16000/22050/24000/32000/44100/48000

Response

The response body is raw audio bytes with the Content-Type matching the requested format (e.g. audio/mpeg, audio/wav).

Examples

$curl -X POST https://api.meshapi.ai/v1/audio/speech \
> -H "Authorization: Bearer rsk_..." \
> -H "Content-Type: application/json" \
> -d '{
> "model": "sarvam/bulbul:v3",
> "input": "Hello! This is a test of the Mesh API text-to-speech endpoint.",
> "voice": "JBFqnCBsd6RMkjVDRZzb",
> "response_format": "mp3_44100_128"
> }' \
> --output speech.mp3

WebSocket TTS Streaming

WS /v1/audio/speech/stream/{voice_id}

Stream text-to-speech in real time. You send text chunks as they become available (e.g. as an LLM streams tokens), and receive audio back chunk by chunk — minimising latency compared to the REST endpoint.

This proxies ElevenLabs’ stream-input WebSocket. The voice_id is part of the URL path.

Authentication

Pass your Mesh API key in one of two ways:

  • Sec-WebSocket-Protocol: Bearer rsk_... header
  • ?api_key=rsk_... query parameter

Query parameters

ParameterDefaultDescription
model_idsarvam/bulbul:v3ElevenLabs model to use.
output_formatpcm_22050Audio format for streamed audio. See supported formats below.
language_codeBCP-47 language code.
enable_loggingtrue/false — ElevenLabs logging opt-out.
enable_ssml_parsingEnable SSML in the input text.
inactivity_timeoutSeconds of inactivity before the session closes (1–180).
sync_alignmentReturn word-level alignment data with each audio chunk.
auto_modeOptimise for low-latency, fully-automated generation.
apply_text_normalizationauto, on, or off.
seedReproducible seed (0–4294967295).

Supported output formats: mp3_22050_32, mp3_44100_32/64/96/128/192, pcm_16000/22050/24000/44100, ulaw_8000

Message protocol

Client → server (JSON frames)

Frame typeFieldsDescription
initializeConnectiontext: " ", optional voice_settings, generation_config, pronunciation_dictionary_locatorsMust be the first message sent.
sendTexttext: string, optional try_trigger_generation, voice_settings, flushSend a chunk of text to synthesize.
closeConnectiontext: ""Signal end of input and close the session cleanly.

Server → client (JSON frames)

Frame typeFieldsDescription
AudioOutputaudio: string (base64), optional alignmentAudio chunk. Decode from base64 to get raw audio bytes.
FinalOutputisFinal: trueSignals all audio has been sent.
errormessage: stringSent before closing on auth, format, or upstream errors.

Any xi-api-key or authorization fields you include in client frames are stripped before being forwarded upstream — your ElevenLabs credentials are never exposed to the client.

Example

1const ws = new WebSocket(
2 "wss://api.meshapi.ai/v1/audio/speech/stream/JBFqnCBsd6RMkjVDRZzb" +
3 "?model_id=sarvam/bulbul:v3&output_format=mp3_44100_128&api_key=rsk_..."
4);
5
6ws.onopen = () => {
7 // 1. Initialize the connection
8 ws.send(JSON.stringify({ text: " " }));
9
10 // 2. Stream text chunks
11 ws.send(JSON.stringify({ text: "Hello, this is streamed text-to-speech." }));
12
13 // 3. Close when done
14 ws.send(JSON.stringify({ text: "" }));
15};
16
17ws.onmessage = ({ data }) => {
18 const frame = JSON.parse(data);
19 if (frame.audio) {
20 // Decode base64 audio and play / write to file
21 const audioBytes = atob(frame.audio);
22 console.log(`Received ${audioBytes.length} bytes of audio`);
23 }
24 if (frame.isFinal) {
25 console.log("Stream complete");
26 ws.close();
27 }
28};