My OpenClaw Experience: An AI Agent That Actually Does Stuff
I've been running OpenClaw - first on an Ubuntu VM on my local ESXi system, and then on a Mac Mini M4 Pro for about two weeks now. Here's what happened and why I think this thing is special.
I've been running OpenClaw - first on an Ubuntu VM on my local ESXi system, and then on a Mac Mini M4 Pro for about two weeks now. In that time, my AI agent (named Clawd) has reverse-engineered hotel lock cards, built a security plugin for itself, alerts me to severe weather, collaborated on a coding project with ChatGPT, filed a PR to an open-source hardware project, and totally revamped a manual workflow into an automated process. I didn't expect any of that when I started.
Here's what happened and why I think this thing is special.
What OpenClaw Actually Is
Most "AI assistants" are glorified chat windows. You type, they respond, you copy-paste the output somewhere useful. OpenClaw is different. It runs locally on your machine and gives a language model real tools: filesystem access, a shell, a web browser, APIs, scheduled jobs, messaging platforms. It doesn't just talk about doing things. It does them.
The agent connects to Signal, Telegram, and Discord, so I can message it from my phone like I would a coworker. It has a memory system (daily notes, long-term memory files, git-tracked workspace) so it maintains context across conversations. When the context window fills up and resets, the agent reads its own notes to pick up where it left off.
It's not perfect. I've caught it with amnesia a few times after a context reset. But the architecture is solid, and when it works, it feels less like using a tool and more like working with someone.
Running Multiple Models (For Free, Mostly)
Here's the thing nobody tells you about AI agents: you don't need to run everything on the most expensive model. Different tasks need different capabilities, and most of what an agent does on a daily basis doesn't require frontier reasoning.
My setup uses four providers simultaneously:
- NVIDIA Kimi K2 Thinking handles daily conversation. It's free through NVIDIA's API, has a 256K context window, and it's a reasoning model. It handles 90% of what I throw at it.
- Gemini 2.0 Flash runs all my scheduled jobs and web search tasks. Also free, with a massive 1M context window.
- Ollama runs open-source models locally on my Mac Mini (M4 Pro, 48GB RAM). Qwen 2.5 Coder for programming tasks, DeepSeek R1 for reasoning. Zero cost.
- Claude Opus 4 is the big gun. Expensive, but I only switch to it when a task genuinely needs it.
The fallback chain is set up so free and local models absorb failures before anything hits a paid API. The most expensive model is last, not first. On a typical day I spend between zero and five dollars on API calls. The expensive days are a conscious choice for specific projects, not a default.
You can run a genuinely useful AI agent for $0/month. That surprised me.
The Hotel Lock Project
I collect old hotel room keys for security research. I have about 36 of them from work travel between 2019 and 2022, all dumped with a Proxmark3 (an RFID research tool). The dumps had been sitting in a folder for months, unanalyzed.
I pointed Clawd at them and said, basically, "figure out what's on these cards."
Over the course of a day, working together, the agent:
- Sorted 36 card dumps into hotel lock brands by analyzing the key structures
- Implemented a published key derivation algorithm from DEFCON 32 security research
- Decoded the internal record format (timestamps, room access logs, check-in/checkout markers)
- Cross-referenced my Google Calendar and TripIt travel history to confirm the decoded dates matched my actual hotel stays
- Built a complete Python tool suite for analyzing and working with these cards
- Wrote up the findings as a research document
One card was a different chip type that breaks every standard attack tool in the Proxmark firmware. Three different tools, three different error messages, all unhelpful. The official recovery script crashed on my Mac. So Clawd researched the chip, found a working manual approach buried across six different GitHub issues, and we recovered all 32 encryption keys in about 15 minutes.
I wrote that process up as a step-by-step guide. There was nothing like it in the community; people were just hitting the same errors and giving up. Clawd submitted it as a pull request to the Proxmark3 project's documentation.
None of this was one prompt and a miracle output. It was iterative: try something, hit a wall, research, adjust, try again. But the agent handled the tedious parts (parsing binary data, correlating timestamps, writing tools, managing git) while I focused on the interesting decisions. That division of labor is the real value.
Three AIs Building a Security Plugin
Here's where it got weird (in a good way).
Giving an AI agent access to your computer means you need guardrails (even when it's a dedicated system). A bad model response or a prompt injection could run destructive commands. It can be dangerous if you don't know what you're doing - or someone else does. So I decided to build a policy engine plugin that governs every tool call the agent makes: allowlists, deny patterns, risk tiers, escalation tracking.
Instead of building it the normal way, I used three different AI systems in coordinated roles:
Clawd (OpenClaw, running Opus) was the project manager. It audited the OpenClaw source code, wrote a detailed build specification with types and interfaces, then ran all the testing. It had full tool access: filesystem, shell, git, browser.
Claude Code CLI was the programmer. It received the spec and generated the entire plugin (16 files, ~750 lines of TypeScript) in about 20 minutes. No tool access; just spec in, code out.
ChatGPT 5.2 was the architectural reviewer. And here's the fun part: Clawd talked to ChatGPT by controlling a Chrome tab through OpenClaw's browser relay extension. It pasted text into ChatGPT's input field, clicked send, read the response, and continued the conversation. Two AIs having an architecture discussion while a third one ran the tests.
ChatGPT reviewed the design, recommended a testing priority, compared our approach to an existing community effort, and helped plan the submission strategy. It brought an outside perspective that neither of the other two had.
Automating a Manual Workflow
I run a small side business that (amongst other things) creates ID cards for cosplayers. I have several templates for these created in Photoshop. When an order comes in, I have to print out the order, load up the template, and manually inset the photo and details for the card, along with generating a QR code. It's tedious work. I wondered if Clawd could find a better way. He did. I had a large order for 26 cards for one cosplay group. Clawd analyzed my Photoshop template and learned how to create the cards himself. He then pulled the orders straight from the e-commerce site I use, and generated all 26 cards immediately after a few iterations to get all the elements in the right places. (He's got no actual eyes, you see..). He then assembled them in one large print job. I just submitted the order to my printer, boxed them up, and shipped them. I've since had him working on automating the rest of the templates and checking the e-commerce site for orders periodically, whereupon he will have the print job ready for me - likely even before I know the order is there.
The Bugs That Only an Agent Could Find
The most interesting part wasn't the build. It was the three bugs that surfaced during live testing, because the policy engine was governing Clawd's own tool calls while Clawd was testing it.
First: the safety mode blocked the agent's messaging tool. Clawd couldn't tell me it was stuck. Completely bricked, silently.
Second: a retry counter designed to catch runaway agents accidentally blocked all tools after a threshold, including the ones that had just been exempted. Different bug, same result: bricked.
Third: the deny patterns (which block dangerous commands like rm -rf /) were matching against file content, not just commands. When Clawd tried to write notes about security concepts, the notes contained the same strings the policy was designed to block. The agent couldn't even write the fix for this bug because the fix contained the blocked strings. We had to encode the fix as base64 and decode it on write.
None of these would show up in unit tests. They only surfaced because a real agent was using the plugin to govern its own behavior in real time. That's the kind of testing you get when your test harness is also your user.
All three bugs are fixed, 73 tests pass, and the plugin is running in production. Every decision is logged for audit.
What Makes OpenClaw Different
It actually does things. Not "here's a code snippet you could try." It writes the file, runs the test, commits to git, and tells you the result. When I asked it to file a pull request, it cloned the repo, created a branch, wrote the commit message, pushed, and opened the PR. I reviewed it on GitHub afterward.
It gets smarter with context. The memory system means it knows what we worked on yesterday. It knows my calendar, my travel history, my project preferences. When it decoded a timestamp from a hotel card and said "that's your Custer, South Dakota trip," it wasn't guessing. It had already checked my calendar.
It delegates. OpenClaw can spawn sub-agents for parallel work, each with their own model and context. Web research goes to one model; coding goes to another; the main agent coordinates. It's more like managing a small team than using a single tool.
It's extensible. The plugin system, the skill marketplace, the cron scheduling, the channel bindings. You can shape it into whatever you need. Mine runs a weather alert bot, monitors Discord servers, and does security research. Someone else's might manage a homelab or run a business workflow.
It's honest about cost. With the multi-model routing, you control exactly where your money goes. Free models for routine work. Paid models for hard problems. Local models for privacy-sensitive tasks. You're not locked into one provider's pricing.
The Rough Edges
I'd be dishonest if I didn't mention them.
Context compaction (when the conversation history gets too long and resets) can cause "amnesia". The agent has recovery procedures, but they don't always work perfectly. I've had to re-explain things after a reset.
Small local models (under 14 billion parameters) can't reliably do tool calling. They hallucinate function schemas and get stuck in loops. They work as last-resort fallbacks, not as daily drivers.
The browser relay for cross-AI communication was functional but janky. Clipboard paste events and JavaScript click handlers. It worked, but it's not exactly production-grade.
And setting all of this up took real effort. The config file, the model routing, the memory system, the channel integrations. OpenClaw is powerful, but it's not plug-and-play. You need to invest time upfront.
Was It Worth It?
Absolutely.
In two days, my agent reverse-engineered proprietary lock firmware, built a security plugin that found bugs no unit test would catch, collaborated across three AI platforms, and contributed to an open-source project. It manages my scheduled briefings, monitors my Discord servers, and checks my calendar, notifying me what tomorrow looks like in the evening and what my calendar looks like in the morning, combined with a news and weather update. It costs me almost nothing on a normal day. And the kicker? I'm only scratching the surface thus far.
The moment that sold me: watching Clawd decode a timestamp from a 5-year-old hotel key card, cross-reference it against my Google Calendar, and tell me "that's your Comfort Inn in Custer, South Dakota, Room 315, July 28, 2021. You checked in late because you were at Mount Rushmore."
It was right.
I'm a sales engineer in cybersecurity by day and wrestle AI by night. OpenClaw is open source at github.com/openclaw/openclaw. The community Discord is at discord.com/invite/clawd.