Speech-to-Text

View as Markdown

Audio Generation

Mesh API provides a full suite of audio endpoints — convert text to speech, transcribe audio files, stream TTS/STT in real time, and browse available voices — all through a single API key.

All endpoints share the same base URL: https://api.meshapi.ai/v1/audio

Auth: Authorization: Bearer rsk_<your-key> on all REST requests. WebSocket endpoints accept the key via Sec-WebSocket-Protocol: Bearer <rsk_...> or ?api_key=<rsk_...>.


Speech-to-Text

POST /v1/audio/transcriptions

Transcribe an audio file. The provider is resolved automatically from the model name. You can supply audio as a file upload, a public URL, or a cloud storage URL.

This endpoint uses multipart/form-data — not JSON.

Form fields

FieldTypeDescription
modelstringRequired. Model ID (e.g. elevenlabs/scribe_v2, sarvam/saaras:v2).
filefileAudio file to transcribe. Required unless source_url or cloud_storage_url is provided.
source_urlstringPublic URL to an audio file. Alternative to uploading a file.
cloud_storage_urlstringCloud storage URL (S3, GCS). Alternative to uploading a file.
language_codestringBCP-47 language code (e.g. en-US). Leave blank for auto-detection.
diarizebooleanIdentify and label different speakers.
num_speakersintegerExpected number of speakers (1–32). Helps diarization accuracy.
timestamps_granularitystringword or character — granularity of returned timestamps.
tag_audio_eventsbooleanDetect non-speech events like laughter, applause, etc.
diarization_thresholdfloatSpeaker change sensitivity (0.1–0.4).
additional_formatsstringJSON-encoded list of extra output formats (ElevenLabs-specific).
file_formatstringExplicit audio format hint if the file extension is ambiguous.
temperaturefloatSampling temperature (0–2).
seedintegerReproducible seed (0–2147483647).
use_multi_channelbooleanProcess each audio channel independently.
webhookbooleanReturn the result via webhook instead of inline.
webhook_idstringID of a pre-configured ElevenLabs webhook.
webhook_metadatastringCustom metadata string to include with the webhook callback.
keytermsstring[]Domain-specific terms to boost recognition accuracy.
entity_detectionstringEnable named entity detection.
entity_redactionstringRedact specific entity types from the transcript.
entity_redaction_modestringHow redacted entities are shown: replace or remove.
no_verbatimbooleanClean up filler words and disfluencies.
detect_speaker_rolesbooleanClassify speakers by role (e.g. interviewer vs. interviewee).
enable_loggingbooleanDefault true. Pass false to opt out of ElevenLabs request logging.
with_timestampsbooleanSarvam-specific: include word-level timestamps.
debug_modebooleanSarvam-specific: return extra debug information.

Response

1{ "text": "The transcribed text goes here." }

Examples

$curl -X POST https://api.meshapi.ai/v1/audio/transcriptions \
> -H "Authorization: Bearer rsk_..." \
> -F "model=elevenlabs/scribe_v2" \
> -F "file=@recording.mp3" \
> -F "language_code=en-US" \
> -F "diarize=true"

Transcribe and Translate

POST /v1/audio/transcriptions/translate

Transcribe audio and translate the result directly to English in a single step. Uses Sarvam models by default.

This endpoint uses multipart/form-data.

Form fields

FieldTypeDefaultDescription
modelstringsarvam/saaras:v2Model to use. Must support translation.
filefileAudio file. Required.
promptstringOptional context prompt to guide translation.

Response

1{ "text": "The English translation of the transcribed audio." }

Example

$curl -X POST https://api.meshapi.ai/v1/audio/transcriptions/translate \
> -H "Authorization: Bearer rsk_..." \
> -F "model=sarvam/saaras:v2" \
> -F "file=@hindi_recording.wav"

If the selected model doesn’t support translation, the API returns a 422 error. Check GET /v1/models to confirm a model’s capabilities.


WebSocket Real-Time STT

WS /v1/audio/transcriptions/realtime

Transcribe audio in real time. Send raw audio chunks as they are captured (e.g. from a microphone) and receive partial and final transcripts as they are produced.

This proxies ElevenLabs’ Scribe v2 Realtime endpoint.

Authentication

Pass your Mesh API key in one of these ways:

  • Sec-WebSocket-Protocol: Bearer rsk_... header
  • ?api_key=rsk_... query parameter
  • ?token=rsk_... query parameter

Query parameters

ParameterDefaultDescription
modelRequired. Model ID (e.g. scribe_v1).
audio_formatpcm_16000Format of the audio chunks you will send.
language_codeBCP-47 language code. Omit for auto-detection.
commit_strategymanualmanual — you commit explicitly. auto — VAD-driven commits.
include_timestampsfalseInclude word-level timestamps in transcripts.
include_language_detectionfalseInclude detected language in each transcript event.
keytermsDomain-specific terms to boost recognition.
no_verbatimfalseRemove filler words and disfluencies.
vad_silence_threshold_secsSilence duration before VAD triggers a commit (0.3–3s).
vad_thresholdVAD speech probability threshold (0.1–0.9).
min_speech_duration_msMinimum speech segment length (50–2000 ms).
min_silence_duration_msMinimum silence to trigger a split (50–2000 ms).
enable_loggingtruePass false to opt out of ElevenLabs logging.

Supported audio formats: pcm_8000, pcm_16000, pcm_22050, pcm_24000, pcm_44100, pcm_48000, ulaw_8000

Message protocol

Client → server (JSON frames)

Send input_audio_chunk frames with base64-encoded audio. This is the only message type the server forwards upstream — any other message type is silently dropped.

1{
2 "message_type": "input_audio_chunk",
3 "audio_base_64": "<base64-encoded PCM audio>",
4 "sample_rate": 16000,
5 "commit": false
6}

Set "commit": true to trigger a VAD commit when using commit_strategy: manual.

Server → client (JSON frames)

Frame typeDescription
session_startedSent once after connection is established.
partial_transcriptIn-progress transcript — text may still change.
committed_transcriptFinal transcript for the committed audio segment.
committed_transcript_with_timestampsFinal transcript including word-level timing data.
errorSent before closing on auth or upstream errors.

Example

1const ws = new WebSocket(
2 "wss://api.meshapi.ai/v1/audio/transcriptions/realtime" +
3 "?model=scribe_v1&audio_format=pcm_16000&commit_strategy=manual&api_key=rsk_..."
4);
5
6ws.onopen = () => {
7 console.log("Connected. Start sending audio chunks.");
8};
9
10ws.onmessage = ({ data }) => {
11 const frame = JSON.parse(data);
12 if (frame.message_type === "partial_transcript") {
13 process.stdout.write("\r" + frame.text);
14 } else if (frame.message_type === "committed_transcript") {
15 console.log("\nFinal:", frame.text);
16 }
17};
18
19// Send a PCM audio chunk (from a microphone, for example)
20function sendAudioChunk(pcmBuffer) {
21 const b64 = Buffer.from(pcmBuffer).toString("base64");
22 ws.send(
23 JSON.stringify({
24 message_type: "input_audio_chunk",
25 audio_base_64: b64,
26 sample_rate: 16000,
27 })
28 );
29}
30
31// Commit the buffered audio to get a final transcript
32function commit() {
33 ws.send(
34 JSON.stringify({
35 message_type: "input_audio_chunk",
36 audio_base_64: "",
37 sample_rate: 16000,
38 commit: true,
39 })
40 );
41}

Voice Management

List voices

GET /v1/audio/voices

Browse voices available on your ElevenLabs account. Supports pagination, full-text search, and filtering.

Query parameterDescription
searchFull-text search across voice name and description.
voice_typeFilter by type: premade, cloned, generated, or professional.
categoryFilter by category.
voice_idsComma-separated list of specific voice IDs to fetch.
page_sizeNumber of results per page (1–100).
next_page_tokenToken from the previous response to fetch the next page.
sortField to sort by.
sort_directionasc or desc.
include_total_countInclude the total number of matching voices in the response.
$curl "https://api.meshapi.ai/v1/audio/voices?search=rachel&voice_type=premade&page_size=10" \
> -H "Authorization: Bearer rsk_..."

Get a single voice

GET /v1/audio/voices/{voice_id}

Fetch metadata for a specific voice by its ElevenLabs voice ID.

$curl https://api.meshapi.ai/v1/audio/voices/JBFqnCBsd6RMkjVDRZzb \
> -H "Authorization: Bearer rsk_..."

Error handling

All endpoints use standard HTTP status codes. Common cases:

CodeMeaning
401Missing or invalid API key.
422Validation error — check the response body for details.
429Rate limit exceeded.
402Insufficient account balance.

WebSocket sessions send a JSON error frame and then close with code 1000 before disconnecting.