openai-compatibleself-hostinginference

OpenAI-Compatible API: Ollama, vLLM, SGLang & TGI

Set up an OpenAI-compatible /v1/chat/completions endpoint with Ollama, vLLM, SGLang or TGI — exact commands, curl tests, a spec-coverage table, and when to switch to a hosted gateway.

Speka Engineering

Jun 11, 2026 · 9 min read

OpenAI-Compatible API Setup: Ollama, vLLM, SGLang & TGI

Last updated: June 2026

Key takeaways

All four servers — Ollama, vLLM, SGLang, and TGI — expose an OpenAI-compatible /v1/chat/completions endpoint, so you can point the official OpenAI SDK at them by changing only base_url and api_key.
vLLM and SGLang cover the most of the OpenAI Chat Completions spec (tools/function calling, JSON/structured outputs, streaming, logprobs); TGI's Messages API is close behind; Ollama is the simplest to run but the thinnest on advanced params.
The biggest hidden costs of self-hosting are GPU provisioning, batching/throughput tuning, and keeping multiple model servers patched — not the one-line launch command.
A managed gateway like Speka gives you one OpenAI-compatible base URL (https://speka.me/v1) across 16 frontier models with no GPUs to run.
You can start on Speka's free plan with $1 of usage included and no credit card, then keep the exact same client code in production.

What does "OpenAI-compatible" actually mean?

An "OpenAI-compatible" inference server implements the same HTTP contract as the OpenAI Chat Completions API: a POST /v1/chat/completions endpoint that accepts a JSON body with model, messages, and parameters like temperature, max_tokens, stream, and tools, and returns a response in OpenAI's choices[].message shape. Because the wire format matches, the OpenAI Python SDK and tools built on it (LangChain, LlamaIndex, n8n) work against your own server by overriding base_url and api_key. Compatibility is rarely 100%: the chat endpoint is well-covered everywhere, but newer surfaces (the Responses API, Assistants, Batch, Files) usually are not.

Below are the exact launch commands and a curl test for each server, followed by a spec-coverage table and guidance on when to stop self-hosting.

How do I run an OpenAI-compatible endpoint with Ollama?

Ollama is the lowest-friction option for local development. After installing it, the daemon already exposes /v1 on port 11434; you just need to pull a model.

# Start the server (usually already running as a service)
ollama serve &

# Pull a model
ollama pull llama3.1:8b

Test the OpenAI-compatible endpoint with curl:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Say hi in one word."}]
  }'

Ollama's OpenAI compatibility layer supports chat, completions, embeddings, streaming, and basic tool calling. The API key is ignored locally (pass any string), which is convenient for dev but means you must add your own auth before exposing it. It is built for single-node, low-concurrency use — great for laptops and prototypes, not for serving production traffic at scale.

How do I run an OpenAI-compatible endpoint with vLLM?

vLLM is the throughput-oriented choice. Its server is OpenAI-compatible out of the box and uses continuous batching and paged attention to maximize GPU utilization.

pip install vllm

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --api-key sk-local-dev

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-local-dev" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Return JSON: {\"ok\": true}"}],
    "response_format": {"type": "json_object"}
  }'

vLLM supports tool/function calling (with per-model parser flags such as --enable-auto-tool-choice), guided/structured decoding, logprobs, and SSE streaming. It is the server most likely to match a given OpenAI parameter, at the cost of a real GPU and some launch-flag tuning.

How do I run an OpenAI-compatible endpoint with SGLang?

SGLang targets high-throughput serving with aggressive prefix caching (RadixAttention) and is strong for agentic and structured-output workloads. Its server speaks the OpenAI API; see the SGLang docs for the full flag set.

pip install "sglang[all]"

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 30000

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "List 3 primes."}],
    "stream": true
  }'

SGLang covers chat completions, function calling, JSON/grammar-constrained outputs, and streaming, and tends to shine when many requests share long common prefixes (system prompts, few-shot examples, RAG context).

How do I run an OpenAI-compatible endpoint with TGI?

Hugging Face's Text Generation Inference (TGI) ships a Messages API that mirrors OpenAI Chat Completions, and it is easiest to run via its Docker image.

docker run --gpus all --shm-size 1g -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tgi",
    "messages": [{"role": "user", "content": "Define idempotent in one line."}],
    "max_tokens": 64
  }'

TGI supports chat via the Messages API, streaming, tool calling, and guided generation. Note the model field is typically "tgi" since each container serves one model; routing happens at the deployment level, not via the request body.

Spec coverage: which server supports what?

Coverage moves fast across releases, so verify against current docs before you commit. As of June 2026:

Capability	Ollama	vLLM	SGLang	TGI
`/v1/chat/completions`	Yes	Yes	Yes	Yes (Messages API)
Streaming (SSE)	Yes	Yes	Yes	Yes
Tool / function calling	Basic	Yes	Yes	Yes
JSON / structured output	Partial	Yes	Yes	Yes
`logprobs`	Limited	Yes	Yes	Partial
Embeddings endpoint	Yes	Yes	Yes	Yes (separate)
Vision (image input)	Model-dependent	Yes	Yes	Model-dependent
Built-in API-key auth	No (ignored)	Yes	Optional	Reverse-proxy
Multi-model on one port	Manual	One per server	One per server	One per container
Primary strength	Local simplicity	Throughput	Prefix caching	HF ecosystem

The recurring theme: a single server instance serves a single model on a single endpoint. Hosting several models means running several processes, sizing GPUs for each, and putting a router and auth layer in front. Server-Sent Events handling and Bearer token auth (RFC 6750) are also yours to operate.

When should I stop self-hosting and use a managed gateway?

Self-host when you need full control of weights, on-prem data residency, custom kernels, or fine-tuned checkpoints — and you have the GPU budget and ops time to keep it running. Switch to a hosted, OpenAI-compatible gateway when you want predictable per-token cost, no GPU provisioning, and one endpoint across many models.

Speka is that gateway. It exposes the same OpenAI contract at https://speka.me/v1, so the OpenAI SDK is a drop-in: change base_url and the key, leave the rest. One key reaches 16 frontier models from 7 labs — DeepSeek, NVIDIA, Meta, Mistral AI, Moonshot AI, OpenAI, and Black Forest Labs — with native tool calling, JSON mode, streaming, embeddings, and image generation.

from openai import OpenAI

client = OpenAI(
    base_url="https://speka.me/v1",
    api_key="sk-speka-live-...",
)

resp = client.chat.completions.create(
    model="meta/llama-3.3-70b-instruct",
    messages=[{"role": "user", "content": "Explain continuous batching in 2 sentences."}],
)
print(resp.choices[0].message.content)

Swap the model id to run different workloads on the same client: deepseek-ai/deepseek-v4-flash for reasoning at $0.27/$1.10 per 1M tokens with a 128K context, openai/gpt-oss-120b for code at $0.15/$0.60, or black-forest-labs/flux-1-dev for image generation at $0.04/image. The full catalog and per-token rates live on the models and pricing pages.

The same base_url switch works with the LangChain ChatOpenAI integration, LlamaIndex, and n8n — anything that already speaks the OpenAI API.

How does a hosted gateway compare to other aggregators?

Speka is a focused catalog: 16 vetted, real models with white-labeled namespaced ids and flat usage-based pricing (every plan includes a monthly allowance; overage bills at standard per-token rates with no penalties). For comparison, as of June 2026, OpenRouter advertises 300+ models across many providers with built-in analytics and a free-model catalog, and Together AI advertises 200+ open and partner models on its own OpenAI-compatible API. Broader marketplaces trade catalog size for more variability in per-model behavior; a smaller curated set trades breadth for predictable coverage and pricing. Pick based on whether you need maximum model selection or a stable, known set.

Frequently asked questions

Can I use the OpenAI Python SDK with Ollama, vLLM, SGLang, or TGI?

Yes. All four implement the OpenAI Chat Completions wire format, so the official OpenAI Python SDK works after you set base_url to the server's /v1 URL and pass an api_key (Ollama ignores it locally; vLLM and others can enforce it). Advanced parameters like tools or response_format depend on the specific server and model supporting them.

Which self-hosted server has the best OpenAI spec coverage?

As of June 2026, vLLM and SGLang generally cover the most of the Chat Completions spec — tool calling, structured/JSON outputs, logprobs, and streaming — making them strong production choices. TGI's Messages API is close behind and integrates tightly with the Hugging Face ecosystem. Ollama covers the core endpoints well but is thinner on advanced parameters. Always verify against current release docs.

Do these servers support tool and function calling?

Yes, with caveats. vLLM, SGLang, and TGI support OpenAI-style tools/tool_choice, though vLLM often needs per-model parser flags (such as --enable-auto-tool-choice) and behavior is model-dependent. Ollama supports basic tool calling. On Speka, native tool/function calling is available across supported chat models through the same OpenAI request shape, with no parser flags to configure.

How is Speka different from running my own model server?

Speka removes the GPU and ops layer. Instead of provisioning hardware, tuning batching, and running one process per model, you call one OpenAI-compatible base URL (https://speka.me/v1) that fronts 16 models. You keep your existing OpenAI SDK code, pay per token with a monthly allowance, and avoid patching and scaling several inference servers yourself.

What does it cost to start on Speka?

The free plan is $0/month, includes $1 of usage, requires no credit card, and allows 10 requests per minute with one API key. Paid tiers add headroom: Starter ($19/mo, $25 included, 60 rpm), Pro ($99/mo, $150 included, 300 rpm, 99.9% uptime target), and Scale ($399/mo, $750 included, 1200 rpm). All bill overage at standard per-token rates with no penalties.

Which models can I call through Speka's OpenAI-compatible API?

Sixteen models from seven labs, including Llama 3.3 70B Instruct, Llama 4 Maverick, Mistral Large 3, Kimi K2.6, the DeepSeek V4 Flash and NVIDIA Nemotron reasoning models, GPT-OSS 120B for code, Llama 3.2 Vision, NV-EmbedQA embeddings, and FLUX.1 for images. Each has its own page under /models with ids and rates.

Try it on Speka

If you want OpenAI-compatible inference without running GPUs, create a free Speka account — $1 of usage, no credit card — point your OpenAI client at https://speka.me/v1, and call any of the 16 models with the code you already have. When you outgrow the free tier, the same key scales through the paid plans on the pricing page. More guides are on the Speka blog.