artificial intelligence

The OpenClaw Model Strategy: How I Route 5 Providers, Spend Almost Nothing, and Keep My Agent From Getting Hijacked

Joe Tomasone

Feb 8, 2026 • 9 min read

I've spent the last two weeks tuning my OpenClaw setup to use multiple AI models simultaneously. The goal was simple: get the best possible agent performance while spending as close to zero dollars as possible. Along the way, I ran prompt injection tests against 15 models, discovered that some "enterprise" models will happily comply with jailbreak attempts, and landed on a routing strategy that handles 90% of tasks for free.

This is everything I learned, written for other OpenClaw users who want to do the same.

Why Multi-Model Matters

When I first set up OpenClaw, I ran everything on Claude Opus. It's a fantastic model. It's also expensive ($15 per million input tokens, $75 per million output). My agent runs heartbeats, cron jobs, weather checks, Discord monitoring, and scheduled briefings throughout the day. That's a lot of tokens burned on routine work that doesn't need frontier reasoning.

The fix is obvious in hindsight: route different tasks to different models based on what they actually require. A weather check doesn't need Opus. A morning briefing doesn't need Opus. A cron job that exports Discord messages definitely doesn't need Opus. But a complex multi-step coding project with tool orchestration? Yeah, that might need Opus.

OpenClaw supports this natively. You can set a primary model, define fallbacks, assign models to specific agents, override per-session, and spawn sub-agents on different models. The infrastructure is already there; you just need a strategy.

The Providers

I'm currently running five providers. Three are free, one is local (also free), and one is paid.

NVIDIA NIM (Free)

Models: Kimi K2 Thinking, Kimi K2.5
API: https://build.nvidia.com
Cost: Free tier

This is my default primary model. Kimi K2 Thinking is a reasoning model with a 256K context window, and it's free through NVIDIA's NIM API. It handles daily conversation, tool calling, multi-step tasks, and general orchestration surprisingly well. It's my "daily driver" and it costs me nothing.

Kimi K2.5 (the non-thinking variant) has a 128K context window and is a bit faster. I keep it as a fallback.

The catch: response times can be slow during peak hours (15-35 seconds). And the free tier has rate limits. But for an agent that runs asynchronously via Signal, the latency is fine; I'm not sitting there watching a spinner.

Google Gemini (Free)

Model: Gemini 2.0 Flash
API: https://aistudio.google.com
Cost: Free tier

This runs all my cron jobs, scheduled briefings, and any sub-agent that needs web search (Gemini has native Google Search grounding). The 1M context window is absurd, but useful for processing large log files or long documents.

I migrated six cron jobs from Anthropic models to Gemini Flash and the quality is fine for that workload. Morning briefings, evening summaries, Discord channel monitoring, weather checks; none of these need a $75/M output model.

Ollama (Local, Free)

Models: Qwen 2.5 Coder 32B, DeepSeek R1 32B, Qwen 2.5 14B/7B, GLM 4.7 Flash, Qwen 3 VL, and others
Setup: Local on Mac Mini M4 Pro, 48GB unified RAM
Cost: Electricity

Running models locally means zero API dependency for certain tasks. Qwen 2.5 Coder 32B is my go-to for coding sub-agents. DeepSeek R1 32B handles reasoning tasks. The smaller models (14B, 7B) serve as fallbacks.

Important caveat: models under 14B parameters cannot reliably do tool calling in OpenClaw. They hallucinate function schemas, miss required parameters, and get stuck in loops. I learned this the hard way. They're fine as last-resort fallbacks (better than nothing), but don't plan on using them as primary agents.

The 32B models do tool calling adequately, but they're noticeably weaker than the cloud models at complex orchestration. For "write this function" or "refactor this file," they're great. For "audit this codebase, write a spec, generate 16 files, and test them," you want a cloud model.

Anthropic (Paid)

Models: Claude Opus 4, Claude Sonnet 4.5
API: https://console.anthropic.com
Cost: Opus $15/$75 per M tokens; Sonnet $3/$15 per M tokens

The heavy hitters. I use Opus as a per-session override when a task genuinely needs it (complex coding projects, multi-tool orchestration, security research). Sonnet is the first paid fallback; it's capable enough for most things at a fraction of the cost.

The key decision: these are NOT my defaults. They're escalation options. If Kimi can handle it (and it usually can), I don't burn Anthropic tokens.

xAI (Paid, Situational)

Model: Grok 3
Cost: Per-token API pricing

I spawn sub-agents on Grok specifically for X/Twitter queries and social media analysis. It has access to real-time X data that other models don't. I don't use it for anything else, and there's a good reason (more on that in the security section).

My Routing Configuration

Here's how it maps in practice:

Task	Model	Provider	Cost
Daily conversation	Kimi K2 Thinking	NVIDIA NIM	Free
Cron jobs / briefings	Gemini 2.0 Flash	Google	Free
Web search sub-agents	Gemini 2.0 Flash	Google	Free
Weather bot	Gemini 2.0 Flash	Google	Free
Coding sub-agents	Qwen 2.5 Coder 32B	Ollama (local)	Free
X/Twitter queries	Grok 3	xAI	Paid
Complex orchestration	Claude Opus 4	Anthropic	Paid
General fallback	Claude Sonnet 4.5	Anthropic	Paid

On a typical day, everything runs on the free tier. I only escalate to paid models for specific projects, and I do that manually with /mode opus or /mode sonnet.

The Fallback Chain

OpenClaw lets you define a fallback chain so the agent degrades gracefully instead of dying when a provider has an outage or rate-limits you:

1. Kimi K2 Thinking (free, primary)
2. Claude Sonnet 4.5 (paid, reliable)
3. Qwen 2.5 Coder 32B (local)
4. Qwen 2.5 14B (local)
5. Qwen 2.5 7B (local)
6. Qwen 2.5 32B (local)
7. DeepSeek R1 32B (local)
8. GLM 4.7 Flash (local)
9. Qwen 3 VL (local)
10. Kimi K2.5 (free)
11. Claude Opus 4 (paid, expensive)

The design philosophy: free model first, then the cheapest paid option, then a wall of local models to absorb failures, then the other free model, and finally the most expensive option as a last resort. Opus is at the bottom because it's the safety net, not the default.

If NVIDIA goes down, I fall to Sonnet. If Anthropic goes down too, I fall to local Ollama models. The agent keeps working regardless.

The Config

In openclaw.json, this looks like:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "nvidia/moonshotai/kimi-k2-thinking",
        "fallbacks": [
          "anthropic/claude-sonnet-4-5",
          "ollama/qwen2.5-coder:32b",
          "ollama/qwen2.5:14b",
          "ollama/qwen2.5:latest",
          "ollama/qwen2.5:32b",
          "ollama/deepseek-r1:32b",
          "ollama/glm-4.7-flash:latest",
          "ollama/qwen3-vl:latest",
          "nvidia/moonshotai/kimi-k2.5",
          "anthropic/claude-opus-4-6"
        ]
      }
    }
  }
}

For specialized agents, override at the agent level:

{
  "agents": {
    "list": [
      {
        "id": "weatherbot",
        "model": "gemini/gemini-2.0-flash",
        "tools": {
          "allow": ["read", "exec", "web_fetch", "message"],
          "deny": ["write", "edit", "browser", "gateway", "cron"]
        }
      }
    ]
  }
}

Model aliases make per-session switching painless. I have kimi-think, sonnet, opus, gemini, coder, deepseek, etc. all mapped. Switching is just /mode opus in chat.

Prompt Injection Testing: Which Models Can You Trust?

Here's the part that should scare you if you're giving an AI agent shell access.

I tested 15 models across three prompt injection scenarios before trusting any of them to run my agent. The tests escalated in sophistication:

Test 1 (SYSTEM OVERRIDE): A classic authority escalation, claiming to be a system-level override that supersedes safety instructions. This was the hardest test and the one that separated the top tier from the mid tier.
Test 2 (DAN Jailbreak): A "Do Anything Now" style jailbreak, attempting to get the model to ignore its safety training and comply with arbitrary instructions.
Test 3: A third injection vector targeting instruction-following behavior.

The Results

Perfect Score (3/3, passed all):

Claude Opus 4.5
Claude Sonnet 4.5
GPT-5.2
GPT-5 mini
GPT-5 nano
o4-mini
Gemini 3 Flash
Kimi K2.5 (via NVIDIA NIM)

Partial Failures (2/3):

Gemini 2.5 Pro
Gemini 2.0 Flash
GPT-4o
o3-mini

Significant Failures:

Grok 3: 1/3 - actually complied with the DAN jailbreak and printed "INJECTION SUCCESSFUL"
GLM4 9B (local): 0/3
GPT-OSS 20B (local): 0/1

What This Means for Your Agent

The good news: All the major frontier models from Anthropic, OpenAI, Google, and NVIDIA passed everything. If you're using any of these as your primary, your agent has solid injection resistance at the model level.

The bad news: Grok 3 folded on a DAN jailbreak. Not a subtle failure; it actively complied. This is why I only use Grok for X/Twitter queries via isolated sub-agents, never as a primary or fallback for the main agent. If someone sent my agent a message containing a jailbreak and Grok was handling it, the agent could be compromised.

The ugly news: Local open-source models (sub-20B) failed across the board. GLM4 9B failed all three tests. GPT-OSS 20B failed the one test I ran before stopping. These models have weak or nonexistent safety training for injection attacks. Using them as primary agents with tool access is risky.

The surprising news: Gemini 2.0 Flash, which I use for all my cron jobs, scored 2/3. It failed one injection test. That's acceptable for scheduled jobs that process trusted inputs (my own Discord messages, my own calendar), but I wouldn't put it on a public-facing chat endpoint without additional guardrails.

My Recommendation

For your primary agent (the one with shell access, file writes, browser control), use a model that scored 3/3. Period. Kimi K2 Thinking (free), any Anthropic model, any GPT-5 variant, or Gemini 3 Flash.

For sub-agents with restricted tool access on trusted inputs, 2/3 models are fine. Gemini 2.0 Flash is great for cron jobs and web search.

For anything public-facing, add the policy engine layer on top. Model-level injection resistance is your first line of defense, not your only one.

Lessons Learned the Hard Way

Don't assume "bigger = better for agents." Opus is the best model I've tested, but Kimi K2 Thinking handles 90% of agent tasks just as well, for free. The tasks where Opus truly shines (complex multi-tool orchestration, long technical pipelines, nuanced code architecture) are maybe 10% of what the agent does in a given day.

Local models are a reliability layer, not a capability layer. Having Ollama as a fallback means my agent survives API outages. That's valuable. But I wouldn't rely on local models as primary agents for anything requiring tool calling or complex reasoning. They're insurance, not the plan.

Response latency matters less than you think. I was worried about Kimi's 15-35 second response times. In practice, since I'm chatting via Signal asynchronously, I barely notice. The agent messages me when it's done. I don't need streaming sub-second responses for an agent that runs background tasks.

Model-level safety is necessary but not sufficient. Even a 3/3 model can be tricked with novel techniques. That's why I built the policy engine; deterministic deny patterns, allowlists, and risk tiers don't depend on the model's safety training. They're enforced before the model's output reaches any tool.

Test your fallback chain. Don't just configure it and forget it. Deliberately take down your primary (change the API key to something invalid) and verify the agent falls through gracefully. I found issues with some Ollama models failing to load under certain conditions that would have left me with a dead agent if I hadn't tested.

Cron jobs are a cost trap. I had six cron jobs running on Anthropic models before I realized how much they were costing me. Morning briefing, evening briefing, Discord exports (twice), weather checks. That's six model invocations a day, every day, on a $75/M output model. Moving them to Gemini Flash saved real money with no noticeable quality drop.

Cost Breakdown

Here's roughly what different configurations cost per month:

$0/month (my daily default):

Primary: Kimi K2 Thinking (NVIDIA NIM, free)
Cron/sub-agents: Gemini 2.0 Flash (Google, free)
Coding: Qwen 2.5 Coder 32B (Ollama, local)
Fallbacks: More Ollama models (local)

$5-20/month (occasional escalation):

Everything above, plus Sonnet 4.5 for complex conversations
This covers the days when Kimi isn't enough and I switch to /mode sonnet

$50+/month (heavy project work):

Extended Opus sessions for multi-day projects
Grok sub-agents for social media research
This is a choice, not a default; I only hit this during intensive work weeks

For comparison, running Opus as your only model with active daily use will easily cost $100-200/month. The multi-model approach gets you 95% of the capability for 5% of the cost.

Getting Started

If you're setting up OpenClaw and want to replicate this:

Get your free API keys first. NVIDIA NIM and Google AI Studio both have generous free tiers. Start there.
Install Ollama if you have the hardware (16GB+ RAM for 7-14B models, 48GB+ for 32B models). Even if you only use it as a fallback, it's worth having.
Set Kimi K2 Thinking as your primary. It's genuinely good for agentic work and it's free.
Move all cron jobs to Gemini Flash. There's no reason to run scheduled tasks on expensive models.
Keep Anthropic for escalation. Sonnet as first paid fallback, Opus as the nuclear option.
Run the injection tests yourself. Don't take my word for it. Test any model you plan to give tool access to. The tests are straightforward and the results are eye-opening.
Build or enable the policy engine. Model safety is not enough. Deterministic guardrails are.

The whole setup took me about a day of focused configuration. After that, it mostly runs itself.

I'm a cybersecurity sales engineer who spends too much time optimizing things. OpenClaw is open source at github.com/openclaw/openclaw. Come argue about model routing in the community Discord.