All posts

What Is an OpenAI-Compatible API? A Developer's Guide

An OpenAI-compatible API speaks the /v1/chat/completions contract. Learn what the spec guarantees, what it doesn't, and how to switch providers by changing base_url.

What Is an OpenAI-Compatible API? A Developer's Guide

Last updated: June 2026

Key takeaways

  • An OpenAI-compatible API exposes the same POST /v1/chat/completions contract OpenAI defined: a JSON body with model and messages, an Authorization: Bearer <key> header, and Server-Sent Events when stream: true.
  • The main practical win is portability: point any OpenAI SDK at a new base_url, swap the API key and the model string, and the rest of your code is unchanged.
  • The wire format is standardized; capabilities are not. Tool/function calling, vision input, JSON mode, and streaming are per-model and per-provider features that vary widely.
  • Self-hosted servers (vLLM, SGLang, TGI, Ollama) and gateways (Speka, OpenRouter, Together AI) all implement subsets of the spec — always check the feature matrix for the specific model you call.
  • Speka is a managed OpenAI-compatible gateway at https://speka.me/v1 serving 16 frontier models from 7 labs through one key and one contract.

What is an OpenAI-compatible API?

An OpenAI-compatible API is any HTTP service that implements the request and response shapes OpenAI published for its Chat Completions endpoint, so existing OpenAI client code works against it without modification. Concretely, the server accepts POST /v1/chat/completions with a JSON body containing a model identifier and a messages array, authenticates via an Authorization: Bearer <token> header, and returns a JSON object containing a choices array — or, when stream: true is set, a Server-Sent Events stream of chat.completion.chunk deltas terminated by data: [DONE]. Because the contract is fixed, you can move a workload from OpenAI to another vendor or to a model you host yourself by changing the base_url, the key, and the model string. Nothing in the spec is OpenAI-proprietary at the protocol level.

Where did the "spec" come from, and is it a real standard?

There is no formal standards body behind it. The "spec" is OpenAI's own Chat Completions API reference plus the de facto behavior of the OpenAI Python SDK. Other vendors and inference servers chose to mirror that surface because it was already the integration target for thousands of libraries and tools. The result is a convention, not a ratified standard — which is exactly why coverage is uneven once you move past the core fields. Treat "OpenAI-compatible" as a claim to verify, not a guarantee.

What does the contract actually look like?

Three pieces define a request: the URL, the auth header, and the JSON body.

Authentication uses the bearer-token scheme from RFC 6750: an Authorization: Bearer <key> header. On Speka, keys carry the sk-speka-live- prefix.

Here is a complete request with curl against Speka's endpoint using a real model id:

curl https://speka.me/v1/chat/completions \
  -H "Authorization: Bearer sk-speka-live-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.3-70b-instruct",
    "messages": [
      {"role": "system", "content": "You are a terse assistant."},
      {"role": "user", "content": "Name three OpenAI-compatible inference servers."}
    ],
    "stream": false
  }'

The response is a JSON object whose choices[0].message.content holds the model's reply, alongside a usage object reporting prompt_tokens, completion_tokens, and total_tokens.

How do I switch providers by changing base_url and model?

This is the entire point of the contract. The OpenAI Python SDK reads two values you can override: the base URL and the API key. Set them, change the model string, and your existing call sites keep working.

from openai import OpenAI

client = OpenAI(
    base_url="https://speka.me/v1",
    api_key="sk-speka-live-...",
)

resp = client.chat.completions.create(
    model="deepseek-ai/deepseek-v4-flash",
    messages=[{"role": "user", "content": "Explain SSE in two sentences."}],
    stream=True,
)

for chunk in resp:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

The same base_url swap works for framework integrations built on the OpenAI client — LangChain's ChatOpenAI, LlamaIndex, and the OpenAI nodes in n8n all accept a custom base URL and key. Browse Speka's model catalog for valid model strings such as meta/llama-3.3-70b-instruct, deepseek-ai/deepseek-v4-flash, and openai/gpt-oss-120b.

How does streaming work over the API?

When you send stream: true, the server responds with Content-Type: text/event-stream and emits incremental Server-Sent Events. Each event is a data: line containing a JSON chat.completion.chunk; the new text lives in choices[0].delta.content. The stream ends with a literal data: [DONE] sentinel. SDKs hide this framing behind an iterator, as in the Python example above, but if you parse the HTTP stream yourself, you handle the data:-prefixed lines and the [DONE] terminator directly.

What does the spec guarantee — and what does it not?

The standardized core is small and reliable: the endpoint path, the bearer-auth header, the model + messages request shape, the choices/usage response shape, and SSE streaming framing. Beyond that, everything is optional and model-dependent.

Feature Standardized wire format? Reliable across all providers/models?
Chat completions (messages) Yes Yes
Bearer-token auth Yes (RFC 6750) Yes
SSE streaming (stream: true) Yes Mostly — verify per server
Tool / function calling (tools) Yes (request shape) No — model and provider must both support it
JSON / structured output mode Partly No — varies by model
Vision (image inputs) Yes (content parts) No — only on vision models
Embeddings (/v1/embeddings) Yes No — separate endpoint, not always present
Image generation (/v1/images) Yes No — image models only

Two endpoints commonly shipped with a chat API but not actually part of the chat contract are /v1/embeddings and /v1/images/generations. A provider can be "OpenAI-compatible" for chat and still not serve either. Likewise, OpenAI's newer Responses API, Assistants/Threads, and the moderation endpoints are usually not mirrored by third parties; Together AI, for example, documents that these are unsupported. Always read the feature matrix for the exact model you intend to call.

Which servers and gateways implement the spec?

Two categories matter: inference servers you run yourself, and managed gateways that aggregate many models behind one key.

Self-hosted servers expose an OpenAI-compatible endpoint in front of weights you operate:

Managed gateways give you many models through one OpenAI-compatible key:

Gateway OpenAI-compatible Model count (approx., as of June 2026) Native tool calling Image generation
Speka Yes — drop-in SDK swap 16 (curated, 7 labs) Yes Yes (FLUX.1)
OpenRouter Yes — documented drop-in 300+ (sources cite up to 500+) Yes, where the underlying model supports it Yes
Together AI Yes (no Responses/Assistants/Batch) 200+ Yes, model-dependent Yes (FLUX, others)

Competitor figures are dated to June 2026 and reflect each vendor's own documentation; model counts in particular move quickly, so verify before you rely on them.

Why does a curated gateway like Speka exist?

A 300-to-500-model marketplace optimizes for breadth. Speka optimizes for a small, verifiable catalog of real models you can name in production code. The 16 models come from DeepSeek, NVIDIA, Meta, Mistral AI, Moonshot AI, OpenAI, and Black Forest Labs — for instance meta-llama/Llama-3.3-70B-Instruct, openai/gpt-oss-120b, and the FLUX.1 [dev] image model from Black Forest Labs. Reasoning and chat models span DeepSeek V4 Flash, Nemotron Super 49B (via NVIDIA NIM), Llama 4 Maverick, Kimi K2.6, and Mistral Large 3 from Mistral AI. Speka supports native tool/function calling, JSON mode, streaming, embeddings, and image generation across the catalog where the model allows it. Full per-model details live on the models page, and the docs cover endpoints and parameters.

What does it cost to run on Speka?

Pricing is usage-based per token, and every plan includes a monthly allowance with overage billed at standard per-token rates — no overage penalties. Representative rates (per 1M tokens, in/out): DeepSeek V4 Flash $0.27 / $1.10, Llama 3.3 70B Instruct $0.20 / $0.20, GPT-OSS 120B $0.15 / $0.60, Mistral Large 3 $0.90 / $2.70, and Kimi K2.6 $0.50 / $2.00. Image models run $0.04/image (FLUX.1 [dev]) and $0.02/image (FLUX.1 [schnell]); embeddings start at $0.01/1M. The Free plan is $0/mo with $1 of usage included, no credit card, 10 rpm, and 1 key — enough to validate the base-URL swap before committing. See pricing for Starter, Pro, and Scale tiers.

Frequently asked questions

Is an OpenAI-compatible API the same as using OpenAI's models?

No. "OpenAI-compatible" describes the request and response format — the /v1/chat/completions contract — not the underlying model. A compatible endpoint can serve Llama, DeepSeek, Mistral, or any other model while still accepting OpenAI-shaped requests. Speka, for example, is OpenAI-compatible but hosts 16 open and frontier models from seven labs, not OpenAI's hosted GPT family.

Can I reuse the OpenAI Python SDK with a different provider?

Yes. The OpenAI Python SDK exposes base_url and api_key parameters on the client. Set base_url to the provider's endpoint (for Speka, https://speka.me/v1), supply that provider's key, and change the model string to one the provider hosts. Existing client.chat.completions.create(...) call sites continue to work without further changes for standard chat requests.

Does OpenAI-compatible guarantee tool calling and vision work?

No. The contract standardizes the shape of the tools parameter and image content parts, but whether a given request succeeds depends on the specific model and provider. Tool calling only works when the underlying model was trained for it and the server implements the feature; vision inputs require a vision-capable model. Always check the provider's per-model capability matrix before relying on these.

What is the difference between a self-hosted server and a managed gateway?

A self-hosted server like vLLM, SGLang, TGI, or Ollama runs model weights on infrastructure you operate and exposes one OpenAI-compatible endpoint per deployment. A managed gateway like Speka, OpenRouter, or Together AI runs the infrastructure for you and fronts many models behind a single key and base URL, handling scaling, billing, and uptime so you only change the model string.

How does streaming differ from a normal response?

A normal response returns one complete JSON object after the model finishes. With stream: true, the server returns a Server-Sent Events stream: incremental chat.completion.chunk events, each carrying a token delta in choices[0].delta.content, ending with data: [DONE]. Streaming lowers time-to-first-token for interactive UIs; the final assembled text is identical to a non-streamed response.

Are embeddings and image generation part of the chat contract?

No. Embeddings (/v1/embeddings) and image generation (/v1/images/generations) are separate endpoints. A provider can be OpenAI-compatible for chat completions while offering neither. Speka serves both — NV-EmbedQA E5 v5 and NV-Embed v1 for embeddings, FLUX.1 [dev] and [schnell] for images — but you should confirm endpoint support with any provider before integrating.

Try it on Speka

If you already speak the OpenAI Chat Completions contract, you are one base-URL change away from running on Speka. Point your client at https://speka.me/v1, drop in an sk-speka-live- key, pick a model from the catalog, and ship. The Free plan requires no credit card and includes $1 of usage so you can verify the swap end-to-end before scaling up — start at /signup and read the docs for the full endpoint reference.

Build with every frontier model

One agentic, OpenAI-compatible API key. Your first key is free and takes 30 seconds.