Compare

View as Markdown

Use POST /v1/chat/compare when you want to run the same conversation across multiple models and inspect the results side by side.

How it works

  1. Fan-out: All requested models are called concurrently. The total wall-clock time is roughly that of the slowest model, not the sum of all models.
  2. Error isolation: If a single model fails or times out (hard timeout of 120s), the others continue unaffected. Partial results are returned with a partial: true flag.
  3. Synthesis (default): After all models respond, a separate comparison LLM analyzes the responses and produces a structured evaluation covering accuracy, completeness, clarity, and a recommendation.
  4. Skip synthesis (optional): By setting skip_comparison: true, you can skip the synthesis step and receive only the raw model outputs. This is useful for parallel streaming UIs that perform their own comparison.
  5. Rate limiting and Billing: The entire comparison counts as a single request against your rate limits (RPM/RPD). However, billing tracks each model call plus the comparison call as separate usage events (N+1 events).
  6. Streaming: Two streaming modes are available by setting stream: true. With synthesis enabled, fan-out is non-streaming, but the final comparison text is streamed token-by-token. If skip_comparison: true is set, each fan-out model streams its tokens in real-time concurrently, tagged by model name.

Basic request

$curl https://api.meshapi.ai/v1/chat/compare \
> -H "Authorization: Bearer <YOUR_RSK_KEY>" \
> -H "Content-Type: application/json" \
> -d '{
> "models": [
> "openai/gpt-4o-mini",
> "anthropic/claude-3.5-haiku"
> ],
> "messages": [
> { "role": "user", "content": "Explain vector search in two sentences." }
> ]
> }'

Request fields

FieldTypeNotes
modelsstring[]Models to compare.
messageschat message[]Conversation sent to each model.
comparison_modelstringOptional model used for synthesized comparison output.
comparison_instructionsstringOptional comparison rubric or guidance.
skip_comparisonbooleanReturn per-model outputs without synthesized comparison text.
streambooleanOptional streaming mode.

Response shape

The response includes:

  • the compared model list
  • one result per model
  • optional synthesized comparison text
  • latency and request metadata

Streaming (SSE)

Set "stream": true to receive a text/event-stream with typed events. There are two streaming modes:

Mode 1: With comparison (skip_comparison: false, default)

Fan-out models are non-streaming (full response collected per model), then the comparison LLM streams token-by-token.

EventWhenPayload
metaImmediately after auth{"comparison_id", "models", "comparison_model", "skip_comparison": false}
model_chunkAs each fan-out model finishes{"model", "delta", "latency_ms", "error", "error_code", "usage"}
model_doneAll fan-out results collected{"results": [...]}
comparison_chunkDuring comparison LLM streaming{"delta": "<token>", "finish_reason": null | "stop"}
doneAll complete{"comparison_id", "total_latency_ms", "partial", "comparison_model", "comparison_fallback_used"}

Mode 2: Skip comparison (skip_comparison: true)

Each fan-out model streams tokens in real time concurrently, tagged by model name. No comparison LLM is called.

EventWhenPayload
metaImmediately after auth{"comparison_id", "models", "comparison_model": null, "skip_comparison": true}
model_chunkEach token from any model{"model": "...", "delta": "Hello", "finish_reason": null}
model_stream_doneOne model’s stream ends{"model": "...", "finish_reason": "stop", "usage": {...}, "error": null | "..."}
doneAll models finished{"comparison_id", "total_latency_ms", "partial", "skip_comparison": true}

SDK coverage

  • Node: client.compare.create(...)
  • Python: client.compare.create(...)
  • Go: client.Compare.Create(...)
  • Java: client.compare().create(...)

Streaming compare is also available in the SDKs through their compare stream methods when you want incremental events instead of a single final JSON response.

When to use compare

  • Evaluate multiple candidate models for a task
  • Compare cost/quality trade-offs before choosing a default
  • Build internal prompt evaluations with a stable request shape