Providers

Caboose supports 15+ LLM providers out of the box. All providers use raw HTTP with server-sent events (SSE) for streaming — no SDK dependencies. This keeps the binary small and behavior consistent across providers.

Cloud Providers

Provider	Notes
Anthropic	Primary provider. Extended thinking support. Prompt caching on system prompts and tools.
OpenAI	Full tool use. Reasoning models (o1, o3, o4) supported.
Gemini	Fast inference, large context. Gemini 2.5 models support thinking.
OpenRouter	Aggregator with access to hundreds of models across providers.
xAI	Grok models.
Together AI	Open-source models at scale.
Fireworks AI	Fast inference for open-source models.
Cerebras	Ultra-fast inference on Cerebras hardware.
SambaNova	High-throughput inference.
Perplexity	Search-augmented models.
Cohere	Command models with strong RAG capabilities.
Qwen	Alibaba’s DashScope models.
DeepSeek	Cost-effective for bulk tasks. Reasoning content supported.
Groq	Ultra-low latency.
Mistral	EU-hosted option.

Each provider entry in the model catalog includes the context window size, tool support status, and pricing information.

Local Providers

Caboose auto-discovers local inference servers on startup by probing common ports. If a server is found, it appears in the model picker immediately.

Provider	Default Port	Notes
Ollama	`11434`	`ollama serve` — auto-discovered
LM Studio	`1234`	Enable local server in LM Studio settings
llama.cpp	`8080`	`llama-server --port 8080`
Custom	any	Any OpenAI-compatible server

Connecting a local server

Ollama, LM Studio, llama.cpp, and a custom server option are pinned at the top of the model picker. Select one to open the connect dialog, enter the address, and Caboose fetches available models from /v1/models. The session remembers manually connected servers.

You can also configure local providers directly in your config:

[[local_providers]]
name = "ollama"
base_url = "http://localhost:11434"
default_model = "llama3.2"

[[local_providers]]
name = "lmstudio"
base_url = "http://localhost:1234"

Switching Models

Switch the active model at any time during a session:

Type /model and select from the list
Press Ctrl+M to open the model picker directly

The model picker groups providers with collapsible ▼/▶ headers. The active provider is expanded by default; others are collapsed. Press Enter on a header to expand or collapse it. OpenRouter and any configured third-party providers are always shown regardless of which provider is active.

The change takes effect on the next message. Your selection is saved to config automatically.

Connecting a Provider

/connect anthropic
/connect openai

Keys are stored locally and never transmitted anywhere except the provider’s own API endpoint. You can also set keys via environment variables (ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.).

Provider Resolution Order

When Caboose starts, it resolves which provider to use in this order:

CLI flag — --provider anthropic or --model claude-sonnet-4
Per-provider config — set in your project or global configuration
Global default — the default_provider field in your config
Fallback — "anthropic" if nothing else is configured

Thinking / Reasoning

Models that support extended thinking can be configured with /reasoning or toggled with Ctrl+T. Four levels are available: Off, Low, Medium, High. Provider-native mapping:

Anthropic — budget_tokens parameter
OpenAI — reasoning_effort (o1/o3/o4 models)
Gemini — thinkingConfig (gemini-2.5 models)
OpenRouter / DeepSeek — reasoning_content field in streaming responses

Thinking blocks stream in real time and are rendered as collapsible sections in chat. The thinking toggle is hidden for models that don’t support it.

Retry Logic

All providers are wrapped in automatic exponential backoff on transient failures (rate limits, network errors, 5xx responses). Retry attempts are logged to the session.

Prompt Caching

Caboose uses provider-native prompt caching where available:

Anthropic — cache_control breakpoints on system prompts and tool definitions
OpenAI — automatic prefix caching
Other providers — used when supported

Cache hit/miss statistics appear in the token usage display after each response.