Audio Generation

Mesh API provides a full suite of audio endpoints — convert text to speech, transcribe audio files, stream TTS/STT in real time, and browse available voices — all through a single API key.

All endpoints share the same base URL: https://api.meshapi.ai/v1/audio

Auth: Authorization: Bearer rsk_<your-key> on all REST requests. WebSocket endpoints accept the key via Sec-WebSocket-Protocol: Bearer <rsk_...> or ?api_key=<rsk_...>.

Speech-to-Text

POST /v1/audio/transcriptions

Transcribe an audio file. The provider is resolved automatically from the model name. You can supply audio as a file upload, a public URL, or a cloud storage URL.

This endpoint uses multipart/form-data — not JSON.

Multiple STT providers are supported — e.g. elevenlabs/scribe_v1 (default) and elevenlabs/scribe_v2, and sarvam/saaras:v2. To list transcription-capable models, use the search endpoint filtered by modality: GET /v1/models/search?input_modality=audio&output_modality=text.

Form fields

Field	Type	Description
`model`	string	Required. Model ID (e.g. `elevenlabs/scribe_v2`, `sarvam/saaras:v2`).
`file`	file	Audio file to transcribe. Required unless `source_url` or `cloud_storage_url` is provided.
`source_url`	string	Public URL to an audio file. Alternative to uploading a file.
`cloud_storage_url`	string	Cloud storage URL (S3, GCS). Alternative to uploading a file.
`language_code`	string	BCP-47 language code (e.g. `en-US`). Leave blank for auto-detection.
`diarize`	boolean	Identify and label different speakers.
`num_speakers`	integer	Expected number of speakers (1–32). Helps diarization accuracy.
`timestamps_granularity`	string	`word` or `character` — granularity of returned timestamps.
`tag_audio_events`	boolean	Detect non-speech events like laughter, applause, etc.
`diarization_threshold`	float	Speaker change sensitivity (0.1–0.4).
`additional_formats`	string	JSON-encoded list of extra output formats (ElevenLabs-specific).
`file_format`	string	Explicit audio format hint if the file extension is ambiguous.
`temperature`	float	Sampling temperature (0–2).
`seed`	integer	Reproducible seed (0–2147483647).
`use_multi_channel`	boolean	Process each audio channel independently.
`webhook`	boolean	Return the result via webhook instead of inline.
`webhook_id`	string	ID of a pre-configured ElevenLabs webhook.
`webhook_metadata`	string	Custom metadata string to include with the webhook callback.
`keyterms`	string[]	Domain-specific terms to boost recognition accuracy.
`entity_detection`	string	Enable named entity detection.
`entity_redaction`	string	Redact specific entity types from the transcript.
`entity_redaction_mode`	string	How redacted entities are shown: `replace` or `remove`.
`no_verbatim`	boolean	Clean up filler words and disfluencies.
`detect_speaker_roles`	boolean	Classify speakers by role (e.g. interviewer vs. interviewee).
`enable_logging`	boolean	Default `true`. Pass `false` to opt out of ElevenLabs request logging.
`with_timestamps`	boolean	Sarvam-specific: include word-level timestamps.
`debug_mode`	boolean	Sarvam-specific: return extra debug information.

Response

1 { "text": "The transcribed text goes here." }

Examples

curl (file upload)

curl (URL)

Python

Node.js

$ curl -X POST https://api.meshapi.ai/v1/audio/transcriptions \
>   -H "Authorization: Bearer rsk_..." \
>   -F "model=elevenlabs/scribe_v2" \
>   -F "file=@recording.mp3" \
>   -F "language_code=en-US" \
>   -F "diarize=true"

Transcribe and Translate

POST /v1/audio/transcriptions/translate

Transcribe audio and translate the result directly to English in a single step. Uses Sarvam models by default.

This endpoint uses multipart/form-data.

Form fields

Field	Type	Default	Description
`model`	string	`sarvam/saaras:v2`	Model to use. Must support translation.
`file`	file	—	Audio file. Required.
`prompt`	string	—	Optional context prompt to guide translation.

Response

1 { "text": "The English translation of the transcribed audio." }

Example

$ curl -X POST https://api.meshapi.ai/v1/audio/transcriptions/translate \
>   -H "Authorization: Bearer rsk_..." \
>   -F "model=sarvam/saaras:v2" \
>   -F "file=@hindi_recording.wav"

If the selected model doesn’t support translation, the API returns a 422 error. Check GET /v1/models to confirm a model’s capabilities.

WebSocket Real-Time STT

WS /v1/audio/transcriptions/realtime

Transcribe audio in real time. Send raw audio chunks as they are captured (e.g. from a microphone) and receive partial and final transcripts as they are produced.

The wire protocol is selected automatically from the model you pass. Standard streaming models (e.g. openai/whisper-large-v3) use an OpenAI-compatible frame protocol. ElevenLabs models (elevenlabs/scribe_v2_realtime) use ElevenLabs’ Scribe v2 realtime frames — see the ElevenLabs models subsection below.

Authentication

Pass your Mesh API key in one of these ways:

Sec-WebSocket-Protocol: Bearer rsk_... header
?api_key=rsk_... query parameter
?token=rsk_... query parameter

Query parameters

Parameter	Default	Description
`model`	—	Required. Model ID (e.g. `openai/whisper-large-v3`, `elevenlabs/scribe_v2_realtime`).
`audio_format`	`pcm_16000`	Format of the audio chunks you will send.
`language_code`	—	BCP-47 language code. Omit for auto-detection.
`commit_strategy`	`manual`	`manual` — you commit explicitly. `auto` — VAD-driven commits.
`include_timestamps`	`false`	Include word-level timestamps in transcripts.
`include_language_detection`	`false`	Include detected language in each transcript event.
`keyterms`	—	Domain-specific terms to boost recognition.
`no_verbatim`	`false`	Remove filler words and disfluencies.
`vad_silence_threshold_secs`	—	Silence duration before VAD triggers a commit (0.3–3s).
`vad_threshold`	—	VAD speech probability threshold (0.1–0.9).
`min_speech_duration_ms`	—	Minimum speech segment length (50–2000 ms).
`min_silence_duration_ms`	—	Minimum silence to trigger a split (50–2000 ms).
`enable_logging`	`true`	Pass `false` to opt out of request logging (ElevenLabs models).

Supported audio formats: pcm_8000, pcm_16000, pcm_22050, pcm_24000, pcm_44100, pcm_48000, ulaw_8000

Standard streaming models

Standard streaming models — e.g. openai/whisper-large-v3 — use an OpenAI-compatible frame protocol.

Client → server (JSON frames)

Frame type	Payload	Description
`input_audio_buffer.append`	`{ "audio": "<base64>" }`	Append a chunk of audio. `audio` is base64-encoded PCM `s16le`, 16 kHz, mono.
`input_audio_buffer.commit`	`{}`	Commit the buffered audio and request a final transcript.

1 {
2   "type": "input_audio_buffer.append",
3   "audio": "<base64-encoded PCM s16le 16 kHz mono>"
4 }

Server → client (JSON frames)

Frame type	Payload	Description
`conversation.item.input_audio_transcription.delta`	`{ "delta": "..." }`	Interim transcript chunk — text may still change.
`conversation.item.input_audio_transcription.completed`	`{ "transcript": "..." }`	Final transcript for the committed audio segment.

Example

1 const ws = new WebSocket(
2   "wss://api.meshapi.ai/v1/audio/transcriptions/realtime" +
3     "?model=openai/whisper-large-v3&api_key=rsk_..."
4 );
5 
6 ws.onopen = () => {
7   console.log("Connected. Start sending audio chunks.");
8 };
9 
10 ws.onmessage = ({ data }) => {
11   const frame = JSON.parse(data);
12   if (frame.type === "conversation.item.input_audio_transcription.delta") {
13     process.stdout.write("\r" + frame.delta);
14   } else if (
15     frame.type === "conversation.item.input_audio_transcription.completed"
16   ) {
17     console.log("\nFinal:", frame.transcript);
18   }
19 };
20 
21 // Send a PCM audio chunk (s16le, 16 kHz, mono — from a microphone, for example)
22 function sendAudioChunk(pcmBuffer) {
23   const b64 = Buffer.from(pcmBuffer).toString("base64");
24   ws.send(
25     JSON.stringify({
26       type: "input_audio_buffer.append",
27       audio: b64,
28     })
29   );
30 }
31 
32 // Commit the buffered audio to get a final transcript
33 function commit() {
34   ws.send(JSON.stringify({ type: "input_audio_buffer.commit" }));
35 }

ElevenLabs models

ElevenLabs models (elevenlabs/scribe_v2_realtime) use ElevenLabs’ Scribe v2 realtime frames.

Client → server (JSON frames)

Send input_audio_chunk frames with base64-encoded audio. This is the only message type the server forwards upstream — any other message type is silently dropped.

1 {
2   "message_type": "input_audio_chunk",
3   "audio_base_64": "<base64-encoded PCM audio>",
4   "sample_rate": 16000,
5   "commit": false
6 }

Set "commit": true to trigger a VAD commit when using commit_strategy: manual.

Server → client (JSON frames)

Frame type	Description
`session_started`	Sent once after connection is established.
`partial_transcript`	In-progress transcript — text may still change.
`committed_transcript`	Final transcript for the committed audio segment.
`committed_transcript_with_timestamps`	Final transcript including word-level timing data.
`error`	Sent before closing on auth or upstream errors.

1 const ws = new WebSocket(
2   "wss://api.meshapi.ai/v1/audio/transcriptions/realtime" +
3     "?model=elevenlabs/scribe_v2_realtime&audio_format=pcm_16000&commit_strategy=manual&api_key=rsk_..."
4 );
5 
6 ws.onopen = () => {
7   console.log("Connected. Start sending audio chunks.");
8 };
9 
10 ws.onmessage = ({ data }) => {
11   const frame = JSON.parse(data);
12   if (frame.message_type === "partial_transcript") {
13     process.stdout.write("\r" + frame.text);
14   } else if (frame.message_type === "committed_transcript") {
15     console.log("\nFinal:", frame.text);
16   }
17 };
18 
19 // Send a PCM audio chunk (from a microphone, for example)
20 function sendAudioChunk(pcmBuffer) {
21   const b64 = Buffer.from(pcmBuffer).toString("base64");
22   ws.send(
23     JSON.stringify({
24       message_type: "input_audio_chunk",
25       audio_base_64: b64,
26       sample_rate: 16000,
27     })
28   );
29 }
30 
31 // Commit the buffered audio to get a final transcript
32 function commit() {
33   ws.send(
34     JSON.stringify({
35       message_type: "input_audio_chunk",
36       audio_base_64: "",
37       sample_rate: 16000,
38       commit: true,
39     })
40   );
41 }

Voice Management

List voices

GET /v1/audio/voices

Returns a unified voice catalog spanning every TTS model brand. The list can be large, so filter by brand or model to narrow it to the voices you can actually use with the model you plan to call.

Query parameter	Description
`brand`	Filter by model brand — e.g. `elevenlabs`, `hexgrad`, `canopylabs`. This is the primary way to scope the catalog.
`model`	Filter by a specific model id (e.g. `hexgrad/kokoro-82m`) — returns only the voices usable with that model.
`search`	Substring match on voice id / name.

Each voice is returned with voice_id, name, brand, provider, model, language, gender, and preview_url. The response also carries count and a truncated flag (true when the list was capped before the full set was fetched).

Pass a voice’s voice_id as the voice field on POST /v1/audio/speech (or the {voice_id} path segment on the streaming endpoint) — and use the same model the voice is listed under. Filtering by brand/model keeps you from picking a voice that isn’t valid for the model you call.

$ # All voices for a given brand
$ curl "https://api.meshapi.ai/v1/audio/voices?brand=elevenlabs&search=rachel" \
>   -H "Authorization: Bearer rsk_..."
$ 
$ # Voices usable with a specific model
$ curl "https://api.meshapi.ai/v1/audio/voices?model=hexgrad/kokoro-82m" \
>   -H "Authorization: Bearer rsk_..."

Get a single voice

GET /v1/audio/voices/{voice_id}

Fetch metadata for a specific voice by its ElevenLabs voice ID.

$ curl https://api.meshapi.ai/v1/audio/voices/JBFqnCBsd6RMkjVDRZzb \
>   -H "Authorization: Bearer rsk_..."

Error handling

All endpoints use standard HTTP status codes. Common cases:

Code	Meaning
`401`	Missing or invalid API key.
`422`	Validation error — check the response body for details.
`429`	Rate limit exceeded.
`402`	Insufficient account balance.

WebSocket sessions send a JSON error frame and then close with code 1000 before disconnecting.