> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://developers.meshapi.ai/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://developers.meshapi.ai/_mcp/server.

# Text-to-Speech

> Text-to-speech, speech-to-text, voice management, and real-time streaming audio APIs.

# Audio Generation

Mesh API provides a full suite of audio endpoints — convert text to speech, transcribe audio files, stream TTS/STT in real time, and browse available voices — all through a single API key.

All endpoints share the same base URL: `https://api.meshapi.ai/v1/audio`

**Auth:** `Authorization: Bearer rsk_<your-key>` on all REST requests. WebSocket endpoints accept the key via `Sec-WebSocket-Protocol: Bearer <rsk_...>` or `?api_key=<rsk_...>`.

***

## Text-to-Speech

`POST /v1/audio/speech`

Convert a text string into audio. The provider (ElevenLabs, Sarvam, etc.) is selected automatically based on the model you pass. Streaming is enabled by default.

### Request body

| Field                        | Type    | Default            | Description                                                                                         |
| :--------------------------- | :------ | :----------------- | :-------------------------------------------------------------------------------------------------- |
| `model`                      | string  | `sarvam/bulbul:v3` | Model ID. Determines which provider handles the request.                                            |
| `input`                      | string  | —                  | Text to synthesize. **Required.**                                                                   |
| `voice`                      | string  | —                  | ElevenLabs voice ID. Required for ElevenLabs models.                                                |
| `speaker`                    | string  | `anushka`          | Speaker name for Sarvam models. Ignored for ElevenLabs.                                             |
| `stream`                     | boolean | `true`             | Stream audio chunks as they are generated.                                                          |
| `response_format`            | string  | provider default   | Output audio format (e.g. `mp3_44100_128`, `pcm_22050`, `wav_44100`).                               |
| `language_code`              | string  | —                  | BCP-47 language code (e.g. `en-US`).                                                                |
| `voice_settings`             | object  | —                  | Fine-tune ElevenLabs voice: `stability`, `similarity_boost`, `style`, `use_speaker_boost`, `speed`. |
| `seed`                       | integer | —                  | Reproducible generation seed.                                                                       |
| `previous_text`              | string  | —                  | Text that came before `input` — used for better prosody.                                            |
| `next_text`                  | string  | —                  | Text that comes after `input` — used for better prosody.                                            |
| `apply_text_normalization`   | string  | —                  | `auto`, `on`, or `off`. Controls ElevenLabs text normalizer.                                        |
| `enable_logging`             | boolean | —                  | Pass `false` to opt out of ElevenLabs request logging.                                              |
| `optimize_streaming_latency` | integer | —                  | ElevenLabs latency-quality trade-off level (0–4).                                                   |
| `pitch`                      | float   | —                  | Sarvam pitch adjustment.                                                                            |
| `pace`                       | float   | —                  | Sarvam speaking pace.                                                                               |
| `loudness`                   | float   | —                  | Sarvam output loudness.                                                                             |
| `target_language_code`       | string  | `hi-IN`            | Sarvam target language.                                                                             |

### Supported output formats

**Streaming (`stream: true`):** `mp3_22050_32`, `mp3_24000_48`, `mp3_44100_32/64/96/128/192`, `pcm_8000/16000/22050/24000/32000/44100/48000`, `ulaw_8000`, `alaw_8000`, `opus_48000_32/64/96/128/192`

**Non-streaming (`stream: false`):** All of the above, plus `wav_8000/16000/22050/24000/32000/44100/48000`

### Response

The response body is raw audio bytes with the `Content-Type` matching the requested format (e.g. `audio/mpeg`, `audio/wav`).

### Examples

```bash
curl -X POST https://api.meshapi.ai/v1/audio/speech \
  -H "Authorization: Bearer rsk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sarvam/bulbul:v3",
    "input": "Hello! This is a test of the Mesh API text-to-speech endpoint.",
    "voice": "JBFqnCBsd6RMkjVDRZzb",
    "response_format": "mp3_44100_128"
  }' \
  --output speech.mp3
```

```bash
curl -X POST https://api.meshapi.ai/v1/audio/speech \
  -H "Authorization: Bearer rsk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sarvam/bulbul:v3",
    "input": "Hello! This is a test.",
    "voice": "JBFqnCBsd6RMkjVDRZzb",
    "stream": false,
    "response_format": "wav_44100"
  }' \
  --output speech.wav
```

```python
import httpx

response = httpx.post(
    "https://api.meshapi.ai/v1/audio/speech",
    headers={"Authorization": "Bearer rsk_..."},
    json={
        "model": "sarvam/bulbul:v3",
        "input": "Hello! This is a test.",
        "voice": "JBFqnCBsd6RMkjVDRZzb",
        "response_format": "mp3_44100_128",
    },
)
with open("speech.mp3", "wb") as f:
    f.write(response.content)
```

```ts
import fs from "fs";

const res = await fetch("https://api.meshapi.ai/v1/audio/speech", {
  method: "POST",
  headers: {
    Authorization: "Bearer rsk_...",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "sarvam/bulbul:v3",
    input: "Hello! This is a test.",
    voice: "JBFqnCBsd6RMkjVDRZzb",
    response_format: "mp3_44100_128",
  }),
});

const buffer = await res.arrayBuffer();
fs.writeFileSync("speech.mp3", Buffer.from(buffer));
```

***

## WebSocket TTS Streaming

`WS /v1/audio/speech/stream/{voice_id}`

Stream text-to-speech in real time. You send text chunks as they become available (e.g. as an LLM streams tokens), and receive audio back chunk by chunk — minimising latency compared to the REST endpoint.

This proxies ElevenLabs' stream-input WebSocket. The `voice_id` is part of the URL path.

### Authentication

Pass your Mesh API key in one of two ways:

* `Sec-WebSocket-Protocol: Bearer rsk_...` header
* `?api_key=rsk_...` query parameter

### Query parameters

| Parameter                  | Default            | Description                                                   |
| :------------------------- | :----------------- | :------------------------------------------------------------ |
| `model_id`                 | `sarvam/bulbul:v3` | ElevenLabs model to use.                                      |
| `output_format`            | `pcm_22050`        | Audio format for streamed audio. See supported formats below. |
| `language_code`            | —                  | BCP-47 language code.                                         |
| `enable_logging`           | —                  | `true`/`false` — ElevenLabs logging opt-out.                  |
| `enable_ssml_parsing`      | —                  | Enable SSML in the input text.                                |
| `inactivity_timeout`       | —                  | Seconds of inactivity before the session closes (1–180).      |
| `sync_alignment`           | —                  | Return word-level alignment data with each audio chunk.       |
| `auto_mode`                | —                  | Optimise for low-latency, fully-automated generation.         |
| `apply_text_normalization` | —                  | `auto`, `on`, or `off`.                                       |
| `seed`                     | —                  | Reproducible seed (0–4294967295).                             |

**Supported output formats:** `mp3_22050_32`, `mp3_44100_32/64/96/128/192`, `pcm_16000/22050/24000/44100`, `ulaw_8000`

### Message protocol

**Client → server (JSON frames)**

| Frame type             | Fields                                                                                           | Description                                        |
| :--------------------- | :----------------------------------------------------------------------------------------------- | :------------------------------------------------- |
| `initializeConnection` | `text: " "`, optional `voice_settings`, `generation_config`, `pronunciation_dictionary_locators` | Must be the first message sent.                    |
| `sendText`             | `text: string`, optional `try_trigger_generation`, `voice_settings`, `flush`                     | Send a chunk of text to synthesize.                |
| `closeConnection`      | `text: ""`                                                                                       | Signal end of input and close the session cleanly. |

**Server → client (JSON frames)**

| Frame type    | Fields                                         | Description                                              |
| :------------ | :--------------------------------------------- | :------------------------------------------------------- |
| `AudioOutput` | `audio: string` (base64), optional `alignment` | Audio chunk. Decode from base64 to get raw audio bytes.  |
| `FinalOutput` | `isFinal: true`                                | Signals all audio has been sent.                         |
| `error`       | `message: string`                              | Sent before closing on auth, format, or upstream errors. |

Any `xi-api-key` or `authorization` fields you include in client frames are stripped before being forwarded upstream — your ElevenLabs credentials are never exposed to the client.

### Example

```js
const ws = new WebSocket(
  "wss://api.meshapi.ai/v1/audio/speech/stream/JBFqnCBsd6RMkjVDRZzb" +
    "?model_id=sarvam/bulbul:v3&output_format=mp3_44100_128&api_key=rsk_..."
);

ws.onopen = () => {
  // 1. Initialize the connection
  ws.send(JSON.stringify({ text: " " }));

  // 2. Stream text chunks
  ws.send(JSON.stringify({ text: "Hello, this is streamed text-to-speech." }));

  // 3. Close when done
  ws.send(JSON.stringify({ text: "" }));
};

ws.onmessage = ({ data }) => {
  const frame = JSON.parse(data);
  if (frame.audio) {
    // Decode base64 audio and play / write to file
    const audioBytes = atob(frame.audio);
    console.log(`Received ${audioBytes.length} bytes of audio`);
  }
  if (frame.isFinal) {
    console.log("Stream complete");
    ws.close();
  }
};
```

***