All posts

LLM API Gateway: What It Is and When You Need One

An LLM API gateway gives you one endpoint, one key, and one bill across providers. Learn what it does, when you need one, and how Speka compares.

LLM API Gateway: What It Is and When You Need One

Last updated: June 2026

Key takeaways

  • An LLM API gateway puts one endpoint, one API key, and one bill in front of many model providers, so you can route, fall back, and track spend without integrating each vendor separately.
  • The category includes hosted gateways/marketplaces like OpenRouter and Together AI, and Speka, which exposes 16 frontier models from 7 labs behind a single OpenAI-compatible API at https://speka.me/v1.
  • You probably want a gateway when you use two or more models, need failover, or want unified cost accounting; you probably don't if you run a single self-hosted model with no fallback requirement.
  • Because Speka is OpenAI-compatible, migrating is a two-line change: set base_url to https://speka.me/v1 and use an sk-speka-live-... key. The rest of your OpenAI SDK code is unchanged.
  • Speka pricing is usage-based per token with a monthly allowance on every plan and no overage penalties — overage bills at standard rates. See /pricing.

What is an LLM API gateway?

An LLM API gateway is a service that sits between your application and one or more model providers, exposing a single HTTP endpoint and a single API key. You send a standard chat-completions request; the gateway routes it to the right model (across providers), handles authentication, optional fallback to a backup model, request/response streaming, and unified usage and cost tracking. Instead of integrating DeepSeek, Meta, Mistral, OpenAI, and others one by one — each with its own SDK, key, rate limit, and invoice — you call the gateway and it normalizes the differences. Most gateways speak the OpenAI Chat Completions schema, so existing client code works with a base-URL swap. The practical payoff is fewer integrations, centralized billing, and the ability to switch or combine models without redeploying client code.

What does an LLM API gateway actually do?

A gateway typically provides five things:

  1. A unified API surface. One schema (usually OpenAI's Chat Completions) for every model, so model: "meta/llama-3.3-70b-instruct" and model: "deepseek-ai/deepseek-v4-flash" are called the same way.
  2. One credential. A single bearer key — passed per RFC 6750 as Authorization: Bearer ... — instead of N provider keys to store and rotate.
  3. Routing and fallback. Pick a model per request; optionally fail over to a backup if the primary errors or is rate-limited.
  4. Streaming. Token-by-token responses over Server-Sent Events (stream: true), so UIs can render incrementally.
  5. Spend tracking. Per-key, per-model usage and cost in one dashboard and one invoice.

Many gateways add native tool/function calling, JSON mode, embeddings, and image generation on top. Speka supports all four.

When do you need an LLM API gateway?

Use this as a decision checklist. You likely need a gateway if any of these are true:

  • You call two or more models (e.g. a cheap model for classification, a stronger one for reasoning) and don't want two integrations.
  • You want failover — automatically route to a backup model when the primary is down or throttled.
  • You need consolidated billing and per-key spend limits across teams, environments, or customers.
  • You're evaluating models and want to swap by changing a string, not rewriting client code.
  • You need multiple modalities (chat, vision, embeddings, image generation) behind one auth scheme.

You likely don't need one if:

  • You run a single self-hosted model with vLLM, SGLang, Hugging Face TGI, or Ollama and have no second provider.
  • Compliance requires inference to stay entirely on your own hardware, with no third-party hop.
  • Your usage is trivial and static — one model, one key, predictable volume — and an extra dependency isn't worth it.

For local-only setups, the OpenAI-compatible servers above already give you a single schema without a gateway. A gateway earns its place once you cross provider boundaries.

Self-managed vs gateway vs hosted unified API

These three approaches overlap, so it helps to be precise about what each optimizes for.

Dimension Self-managed per-provider Hosted unified API / gateway (this category) Self-hosted single endpoint
Integrations to maintain One per provider One One
Keys to manage One per provider One One (yours)
Routing & fallback across providers Build it yourself Built-in N/A (single backend)
Unified billing No — N invoices Yes — one invoice You pay infra directly
Model breadth Whatever you wire up Curated to very large catalogs Whatever you host
Infra/ops burden Medium–high Low High (GPUs, scaling)
Data path Direct to each provider Through the gateway Stays on your hardware

"Hosted unified API" and "gateway" are effectively the same category here. Named honestly, it includes OpenRouter, Together AI, and Speka, with different trade-offs:

  • OpenRouter is a marketplace-style gateway. As of June 2026 its own docs cite 300+ models across many providers (third-party sources cite more), with smart routing, fallbacks, unified billing, a usage-analytics dashboard, a free-model catalog (with rate limits), and image generation. If raw breadth and provider choice are the priority, that's its strength.
  • Together AI advertises 200+ open-source and partner models behind an OpenAI-compatible API, with tool calling on most chat models, image generation, and usage/spend dashboards. Note (per its docs) that some OpenAI surfaces — Responses API, Assistants/Threads — are not supported. Its free signup credit amount varies by source/promo, so verify on signup.
  • Speka is a curated gateway: 16 frontier models from 7 labs (DeepSeek, NVIDIA, Meta, Mistral AI, Moonshot AI, OpenAI, Black Forest Labs), OpenAI-compatible, with tool calling, JSON mode, streaming, embeddings, and image generation. Fewer models, but every one is a known frontier model with published per-token pricing. See the full list at /models.

The honest summary: choose breadth-first marketplaces when you want to shop hundreds of models; choose a curated gateway like Speka when you want a vetted shortlist with transparent pricing and no surprises.

How do I call an LLM gateway? (Speka worked example)

Speka is a drop-in for the OpenAI Python SDK — change two values.

from openai import OpenAI

client = OpenAI(
    base_url="https://speka.me/v1",
    api_key="sk-speka-live-...",  # your Speka key
)

resp = client.chat.completions.create(
    model="meta/llama-3.3-70b-instruct",   # see /models for ids
    messages=[{"role": "user", "content": "Explain an API gateway in one sentence."}],
    stream=True,
)
for chunk in resp:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

The same request with curl:

curl https://speka.me/v1/chat/completions \
  -H "Authorization: Bearer sk-speka-live-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/deepseek-v4-flash",
    "messages": [{"role": "user", "content": "Give me three failover strategies."}]
  }'

To switch models, change one string — for example to openai/gpt-oss-120b for code or meta/llama-4-maverick-17b-128e-instruct for chat. Framework integrations work the same way: LangChain's ChatOpenAI, LlamaIndex, and n8n all accept a custom base URL. Full reference lives at /docs.

What models and prices does Speka offer?

A sample of the catalog, with per-1M-token input/output pricing. Embeddings price input only; images price per image.

Model ID Category Price (in / out per 1M)
Llama 3.3 70B Instruct meta/llama-3.3-70b-instruct chat $0.20 / $0.20
DeepSeek V4 Flash deepseek-ai/deepseek-v4-flash reasoning $0.27 / $1.10
Mistral Large 3 mistralai/mistral-large-3-675b-instruct-2512 chat $0.90 / $2.70
Kimi K2.6 moonshotai/kimi-k2.6 chat $0.50 / $2.00
GPT-OSS 120B openai/gpt-oss-120b code $0.15 / $0.60
Llama 3.1 8B meta/llama-3.1-8b-instruct chat $0.05 / $0.05
NV-EmbedQA E5 v5 nvidia/nv-embedqa-e5-v5 embeddings $0.01 / 1M
FLUX.1 [dev] (image) image $0.04 / image

Model weights and provider details are public: Llama 3.3 70B, DeepSeek, GPT-OSS 120B, FLUX.1-dev, plus reasoning models from NVIDIA NIM. The complete 16-model catalog with context windows is at /models.

How is a gateway billed?

Speka bills usage-based per token. Every plan includes a monthly usage allowance; when you exceed it, overage bills at the standard per-token rates above with no penalty multiplier.

Plan Price Usage included Rate limit Keys
Free $0/mo $1 (no card) 10 rpm 1
Starter $19/mo $25 60 rpm 5
Pro $99/mo $150 300 rpm 25
Scale $399/mo $750 1200 rpm 200

Pro targets 99.9% uptime; Scale adds SSO and audit logging on request. Full details: /pricing.

Frequently asked questions

Is an LLM API gateway OpenAI-compatible?

Most are, and Speka is fully OpenAI-compatible. You point the official OpenAI SDK (or any client speaking the Chat Completions schema) at https://speka.me/v1 with an sk-speka-live-... key, and your existing chat.completions.create calls work unchanged. Model ids are namespaced (for example meta/llama-3.3-70b-instruct), and streaming, tool calling, and JSON mode all use the standard OpenAI parameters.

What's the difference between a gateway and a hosted unified API?

In practice, none — they're the same category described from different angles. "Gateway" emphasizes the routing/fallback/auth layer in front of providers; "hosted unified API" emphasizes that you call one managed endpoint. OpenRouter, Together AI, and Speka all fit both labels. A self-hosted OpenAI-compatible server (vLLM, SGLang, TGI, Ollama) gives you a single schema but only for the models you run yourself.

Do I still need a gateway if I self-host with vLLM or Ollama?

Not necessarily. If you serve one model on your own hardware and never need a second provider or cross-provider failover, an OpenAI-compatible server like vLLM, SGLang, TGI, or Ollama already gives you one endpoint. A gateway becomes worthwhile once you call multiple models, want automatic fallback, or need consolidated billing and per-key spend tracking across providers you don't host.

How does failover work in a gateway?

Failover routes a request to a backup model when the primary returns an error or is rate-limited, so a single provider outage doesn't break your application. You define a primary model (say deepseek-ai/deepseek-v4-flash) and a fallback (say meta/llama-3.3-70b-instruct). Because the gateway normalizes every model to the same request and response schema, your client code handles both responses identically without branching.

How much does an LLM API gateway cost?

Gateways are typically usage-based per token, often with a monthly allowance. On Speka, the Free plan includes $1 of usage with no credit card; Starter is $19/mo with $25 included; Pro is $99/mo with $150 included; Scale is $399/mo with $750 included. Overage bills at the standard per-token rates with no penalty multiplier. Per-model input/output pricing is listed on /pricing and each model page.

Can a gateway do embeddings and image generation too?

Yes, when the catalog includes those models. Speka serves embedding models (NV-EmbedQA E5 v5 at $0.01/1M, NV-Embed v1 at $0.016/1M) and image models (FLUX.1 [dev] at $0.04/image, FLUX.1 [schnell] at $0.02/image) through the same key and base URL as chat. That lets you build retrieval, ranking, and text-to-image workflows without adding separate providers or credentials.

Try it on Speka

If you call more than one model — or expect to — a gateway removes the per-provider integration tax and gives you one bill to reason about. Speka is a curated, OpenAI-compatible gateway: 16 frontier models, transparent per-token pricing, and a two-line migration from the OpenAI SDK. Start on the Free plan with no credit card at /signup, browse the catalog at /models, and read the integration guide at /docs. More engineering write-ups are on the /blog.

Build with every frontier model

One agentic, OpenAI-compatible API key. Your first key is free and takes 30 seconds.