> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://developers.meshapi.ai/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://developers.meshapi.ai/_mcp/server.

# Speech-to-Text

> Text-to-speech, speech-to-text, voice management, and real-time streaming audio APIs.

# Audio Generation

Mesh API provides a full suite of audio endpoints — convert text to speech, transcribe audio files, stream TTS/STT in real time, and browse available voices — all through a single API key.

All endpoints share the same base URL: `https://api.meshapi.ai/v1/audio`

**Auth:** `Authorization: Bearer rsk_<your-key>` on all REST requests. WebSocket endpoints accept the key via `Sec-WebSocket-Protocol: Bearer <rsk_...>` or `?api_key=<rsk_...>`.

***

## Speech-to-Text

`POST /v1/audio/transcriptions`

Transcribe an audio file. The provider is resolved automatically from the model name. You can supply audio as a file upload, a public URL, or a cloud storage URL.

This endpoint uses `multipart/form-data` — not JSON.

### Form fields

| Field                    | Type      | Description                                                                                |
| :----------------------- | :-------- | :----------------------------------------------------------------------------------------- |
| `model`                  | string    | **Required.** Model ID (e.g. `elevenlabs/scribe_v2`, `sarvam/saaras:v2`).                  |
| `file`                   | file      | Audio file to transcribe. Required unless `source_url` or `cloud_storage_url` is provided. |
| `source_url`             | string    | Public URL to an audio file. Alternative to uploading a file.                              |
| `cloud_storage_url`      | string    | Cloud storage URL (S3, GCS). Alternative to uploading a file.                              |
| `language_code`          | string    | BCP-47 language code (e.g. `en-US`). Leave blank for auto-detection.                       |
| `diarize`                | boolean   | Identify and label different speakers.                                                     |
| `num_speakers`           | integer   | Expected number of speakers (1–32). Helps diarization accuracy.                            |
| `timestamps_granularity` | string    | `word` or `character` — granularity of returned timestamps.                                |
| `tag_audio_events`       | boolean   | Detect non-speech events like laughter, applause, etc.                                     |
| `diarization_threshold`  | float     | Speaker change sensitivity (0.1–0.4).                                                      |
| `additional_formats`     | string    | JSON-encoded list of extra output formats (ElevenLabs-specific).                           |
| `file_format`            | string    | Explicit audio format hint if the file extension is ambiguous.                             |
| `temperature`            | float     | Sampling temperature (0–2).                                                                |
| `seed`                   | integer   | Reproducible seed (0–2147483647).                                                          |
| `use_multi_channel`      | boolean   | Process each audio channel independently.                                                  |
| `webhook`                | boolean   | Return the result via webhook instead of inline.                                           |
| `webhook_id`             | string    | ID of a pre-configured ElevenLabs webhook.                                                 |
| `webhook_metadata`       | string    | Custom metadata string to include with the webhook callback.                               |
| `keyterms`               | string\[] | Domain-specific terms to boost recognition accuracy.                                       |
| `entity_detection`       | string    | Enable named entity detection.                                                             |
| `entity_redaction`       | string    | Redact specific entity types from the transcript.                                          |
| `entity_redaction_mode`  | string    | How redacted entities are shown: `replace` or `remove`.                                    |
| `no_verbatim`            | boolean   | Clean up filler words and disfluencies.                                                    |
| `detect_speaker_roles`   | boolean   | Classify speakers by role (e.g. interviewer vs. interviewee).                              |
| `enable_logging`         | boolean   | Default `true`. Pass `false` to opt out of ElevenLabs request logging.                     |
| `with_timestamps`        | boolean   | Sarvam-specific: include word-level timestamps.                                            |
| `debug_mode`             | boolean   | Sarvam-specific: return extra debug information.                                           |

### Response

```json
{ "text": "The transcribed text goes here." }
```

### Examples

```bash
curl -X POST https://api.meshapi.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer rsk_..." \
  -F "model=elevenlabs/scribe_v2" \
  -F "file=@recording.mp3" \
  -F "language_code=en-US" \
  -F "diarize=true"
```

```bash
curl -X POST https://api.meshapi.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer rsk_..." \
  -F "model=elevenlabs/scribe_v2" \
  -F "source_url=https://example.com/meeting.mp3" \
  -F "diarize=true" \
  -F "num_speakers=3"
```

```python
import httpx

with open("recording.mp3", "rb") as f:
    response = httpx.post(
        "https://api.meshapi.ai/v1/audio/transcriptions",
        headers={"Authorization": "Bearer rsk_..."},
        data={
            "model": "elevenlabs/scribe_v2",
            "language_code": "en-US",
            "diarize": "true",
        },
        files={"file": ("recording.mp3", f, "audio/mpeg")},
    )

print(response.json()["text"])
```

```ts
import fs from "fs";

const form = new FormData();
form.append("model", "elevenlabs/scribe_v2");
form.append("language_code", "en-US");
form.append("file", new Blob([fs.readFileSync("recording.mp3")]), "recording.mp3");

const res = await fetch("https://api.meshapi.ai/v1/audio/transcriptions", {
  method: "POST",
  headers: { Authorization: "Bearer rsk_..." },
  body: form,
});

const { text } = await res.json();
console.log(text);
```

***

## Transcribe and Translate

`POST /v1/audio/transcriptions/translate`

Transcribe audio and translate the result directly to English in a single step. Uses Sarvam models by default.

This endpoint uses `multipart/form-data`.

### Form fields

| Field    | Type   | Default            | Description                                   |
| :------- | :----- | :----------------- | :-------------------------------------------- |
| `model`  | string | `sarvam/saaras:v2` | Model to use. Must support translation.       |
| `file`   | file   | —                  | Audio file. **Required.**                     |
| `prompt` | string | —                  | Optional context prompt to guide translation. |

### Response

```json
{ "text": "The English translation of the transcribed audio." }
```

### Example

```bash
curl -X POST https://api.meshapi.ai/v1/audio/transcriptions/translate \
  -H "Authorization: Bearer rsk_..." \
  -F "model=sarvam/saaras:v2" \
  -F "file=@hindi_recording.wav"
```

If the selected model doesn't support translation, the API returns a `422` error. Check `GET /v1/models` to confirm a model's capabilities.

***

## WebSocket Real-Time STT

`WS /v1/audio/transcriptions/realtime`

Transcribe audio in real time. Send raw audio chunks as they are captured (e.g. from a microphone) and receive partial and final transcripts as they are produced.

This proxies ElevenLabs' Scribe v2 Realtime endpoint.

### Authentication

Pass your Mesh API key in one of these ways:

* `Sec-WebSocket-Protocol: Bearer rsk_...` header
* `?api_key=rsk_...` query parameter
* `?token=rsk_...` query parameter

### Query parameters

| Parameter                    | Default     | Description                                                    |
| :--------------------------- | :---------- | :------------------------------------------------------------- |
| `model`                      | —           | **Required.** Model ID (e.g. `scribe_v1`).                     |
| `audio_format`               | `pcm_16000` | Format of the audio chunks you will send.                      |
| `language_code`              | —           | BCP-47 language code. Omit for auto-detection.                 |
| `commit_strategy`            | `manual`    | `manual` — you commit explicitly. `auto` — VAD-driven commits. |
| `include_timestamps`         | `false`     | Include word-level timestamps in transcripts.                  |
| `include_language_detection` | `false`     | Include detected language in each transcript event.            |
| `keyterms`                   | —           | Domain-specific terms to boost recognition.                    |
| `no_verbatim`                | `false`     | Remove filler words and disfluencies.                          |
| `vad_silence_threshold_secs` | —           | Silence duration before VAD triggers a commit (0.3–3s).        |
| `vad_threshold`              | —           | VAD speech probability threshold (0.1–0.9).                    |
| `min_speech_duration_ms`     | —           | Minimum speech segment length (50–2000 ms).                    |
| `min_silence_duration_ms`    | —           | Minimum silence to trigger a split (50–2000 ms).               |
| `enable_logging`             | `true`      | Pass `false` to opt out of ElevenLabs logging.                 |

**Supported audio formats:** `pcm_8000`, `pcm_16000`, `pcm_22050`, `pcm_24000`, `pcm_44100`, `pcm_48000`, `ulaw_8000`

### Message protocol

**Client → server (JSON frames)**

Send `input_audio_chunk` frames with base64-encoded audio. This is the only message type the server forwards upstream — any other message type is silently dropped.

```json
{
  "message_type": "input_audio_chunk",
  "audio_base_64": "<base64-encoded PCM audio>",
  "sample_rate": 16000,
  "commit": false
}
```

Set `"commit": true` to trigger a VAD commit when using `commit_strategy: manual`.

**Server → client (JSON frames)**

| Frame type                             | Description                                        |
| :------------------------------------- | :------------------------------------------------- |
| `session_started`                      | Sent once after connection is established.         |
| `partial_transcript`                   | In-progress transcript — text may still change.    |
| `committed_transcript`                 | Final transcript for the committed audio segment.  |
| `committed_transcript_with_timestamps` | Final transcript including word-level timing data. |
| `error`                                | Sent before closing on auth or upstream errors.    |

### Example

```js
const ws = new WebSocket(
  "wss://api.meshapi.ai/v1/audio/transcriptions/realtime" +
    "?model=scribe_v1&audio_format=pcm_16000&commit_strategy=manual&api_key=rsk_..."
);

ws.onopen = () => {
  console.log("Connected. Start sending audio chunks.");
};

ws.onmessage = ({ data }) => {
  const frame = JSON.parse(data);
  if (frame.message_type === "partial_transcript") {
    process.stdout.write("\r" + frame.text);
  } else if (frame.message_type === "committed_transcript") {
    console.log("\nFinal:", frame.text);
  }
};

// Send a PCM audio chunk (from a microphone, for example)
function sendAudioChunk(pcmBuffer) {
  const b64 = Buffer.from(pcmBuffer).toString("base64");
  ws.send(
    JSON.stringify({
      message_type: "input_audio_chunk",
      audio_base_64: b64,
      sample_rate: 16000,
    })
  );
}

// Commit the buffered audio to get a final transcript
function commit() {
  ws.send(
    JSON.stringify({
      message_type: "input_audio_chunk",
      audio_base_64: "",
      sample_rate: 16000,
      commit: true,
    })
  );
}
```

***

## Voice Management

### List voices

`GET /v1/audio/voices`

Browse voices available on your ElevenLabs account. Supports pagination, full-text search, and filtering.

| Query parameter       | Description                                                          |
| :-------------------- | :------------------------------------------------------------------- |
| `search`              | Full-text search across voice name and description.                  |
| `voice_type`          | Filter by type: `premade`, `cloned`, `generated`, or `professional`. |
| `category`            | Filter by category.                                                  |
| `voice_ids`           | Comma-separated list of specific voice IDs to fetch.                 |
| `page_size`           | Number of results per page (1–100).                                  |
| `next_page_token`     | Token from the previous response to fetch the next page.             |
| `sort`                | Field to sort by.                                                    |
| `sort_direction`      | `asc` or `desc`.                                                     |
| `include_total_count` | Include the total number of matching voices in the response.         |

```bash
curl "https://api.meshapi.ai/v1/audio/voices?search=rachel&voice_type=premade&page_size=10" \
  -H "Authorization: Bearer rsk_..."
```

### Get a single voice

`GET /v1/audio/voices/{voice_id}`

Fetch metadata for a specific voice by its ElevenLabs voice ID.

```bash
curl https://api.meshapi.ai/v1/audio/voices/JBFqnCBsd6RMkjVDRZzb \
  -H "Authorization: Bearer rsk_..."
```

***

## Error handling

All endpoints use standard HTTP status codes. Common cases:

| Code  | Meaning                                                 |
| :---- | :------------------------------------------------------ |
| `401` | Missing or invalid API key.                             |
| `422` | Validation error — check the response body for details. |
| `429` | Rate limit exceeded.                                    |
| `402` | Insufficient account balance.                           |

WebSocket sessions send a JSON `error` frame and then close with code `1000` before disconnecting.