Some obvious ones are only grant read and draft permissions at all, and review and send drafts manually.
Some more clever ones are to only allow sending 5 messages a day, or enforcing soft delete patterns. This prevents accidentally spamming everyone or deleting things.
Next up is giving the agent “wrapped” and down scoped tokens you do want to equip it with the ability to do direct API calls. But these still go through the proxy that enforces the policies too.
The industry is talking in circles here. All you need is "composability".
UNIX solved this with files and pipes for data, and processes for compute.
AI agents are solving this this with sub-agents for data, and "code execution" for compute.
The UNIX approach is both technically correct and elegant, and what I strongly favor too.
The agent + MCP approach is getting there. But not every harness has sub-agents, or their invocation is non-deterministic, which is where "MCP context bloat" happens.
What's interesting to me is that while it was obvious to all of us who came to think in the Unix Way, that insofar as composability, usage discoverability, and gobs of documentation in posts and man pages that are hugely represented in training corpora for LLMs, that the CLI is a great fit for LLM tool use, it seems only a recent trend to acknowledge this (and also the next hype wave, perhaps.)
Also interesting that while the big vendors are following this trend and are now trying to take a lead in it, they still suggest things like "but use a JSON schema" (the linked article does a bit of the same - acknowledging that incremental learning via `--help` is useful AND can be token-conserving (exception being that if they already "know" the correct pattern, they wouldn't need to use tokens to learn it, so there is a potential trade-off), they are also suggesting that LLMs would prefer to receive argument knowledge in json rather than in plain language, even though the entire point of an LLM is for understand and create plain language. Seemed dubious to me, and a part of me wondered if that advice may be nonsense motivated by desire to sell more token use. I'm only partially kidding and I'm still dubious of the efficacy.
* Here's a TL;DR for anyone who wants to skip the rest of this long message: I ran an LLM CLI eval in the form of a constructed CTF. Results and methodology are in the two links in the section linked:
https://github.com/scottvr/jelp?tab=readme-ov-file#what-else
Anyhow... I had been experimenting with the idea of having --help output json when used by a machine, and came up with a simple module that exposes `--help` content as json, simply by adding a `--jelp` argument to any tool that already uses argparse.
In the process, I started testing, to see if all this extra machine-readable content actually improved performance, what it did to token use, etc. While I was building out test, trying to settle on legitimate and fair ways to come to valid conclusions, I learned of the OpenCLI schema draft, so I altered my `jelp` output to fit that schema, and set about documenting the things I found lacking from the schema draft, meanwhile settling to include these arg-related items as metadata in the output.
I'll get to the point. I just finished cleaning the output up enough to put it in a public repo, because my intent is to share my findings with the OpemCLI folks, in hopes that they'll consider the gaps in their schema compared to what's commonly in use, but at the same time, what came as a secondary thought in service of this little tool I called "jelp", is a benchmarking harness (and the first publishable results from it), the to me, are quite interesting and I would be happy if others found it to be and added to the existing test results with additional runs, models, or ideas for the harness, or criticism about the validity of the method, etc.
The evaluation harness uses constructed CLI fixtures arranged as little CLI CTF's, where the LLMs demonstrate their ability to use an unknown CLI be capturing a "flag" that they'll need to discover by using the usage help, and a trail of learned arguments.
My findings at first confirmed my intuitions, which was disappointing but unsurprising. When testing with GPT-4.1-mini, no manner of forcing them to receive info about the CLI via json was more effective than just letting them use the human-friendly plain English output of --help, and in all cases the JSON versions burned more tokens. I was able to elicit better performance by some measurements from 5.1-mini, but again the tradeoff was higher token burn.
I'll link straight to the part of the README that shows one table of results, and contains links to the LLM CLI CTF part of the repo, as well as the generated report after the phase-1 runs; all the code to reproduce or run your own variation is there (as well as the code for the jelp module, if there is any interest, but it's the CLI CTF eval that I expect is more interesting to most.)
Experience level: very senior, programming for 25 years, have managed platform teams at Heroku and Segment.
Project type: new startup started Jan ‘26 at https://housecat.com. Pitch is “dev tools for non developers”
Team size: currently 2.
Stack: Go, vanilla HTML/CSS/JS, Postgres, SQLite, GCP and exe.dev.
Claude code and other coding harnesses fully replaced typing code in an IDE over the past year for me.
I’ve tried so many tools. Cursor, Claude and Codex, open source coding agents, Conductor, building my own CLIs and online dev environments. Tool churn is a challenge but it pays dividends to keep trying things as there have been major step functions in productivity and multi tasking. I value the HN community for helping me discover and cut through the space.
Multiple VMs available over with SSH with an LLM pre-configured has been the latest level up.
Coding is still hard work designing tests, steering agents, reviewing code, and splitting up PRs. I still use every bit of my experience every day and feel tired at end of day.
My non-programmer co-founder, more of a product manager and biz ops person, has challenges all the time. He generally can only write functional prototypes. We solve this by embracing the functional prototype and doing a lot of pair programming. It is much more productive than design docs or Figma wireframes.
In general the game changer is how much a couple of people can get done. We’re able to prototype ideas, build the real app, manage SOC2 infra, marketing and go to market better than ever thanks to the “willing interns” we have. I’ve done all this before and the AI helps with so much of the boilerplate and busywork.
I’m looking for beta testers and security researchers for the product, as well as a full time engineer if anyone is interested in seeing what a “greenfield” product, engineering culture and business looks like in 2026. Contact info in my profile.
Interesting premise for your product. Hope you find success!
From a dev perspective I feel your website pass a vibe more of a "OpenClaw you can trust" than "dev tool for non developers". Is that right? Or am I misreading the idea?
The OpenClaw stuff is awesome but it’s too raw for a lot of professionals and small teams. We’re trying to bring more guardrails to the concept and more of a Ruby on Rails philosophy to how it works.
> Want to keep LLM .md files in a separate overlay, only make them visible on request? Also easy. CRDT gives the freedom in splitting and joining along all the axes.
I now have a bunch of layers of text / markdown: system prompts, AGENTS.md, SKILL.md, plus user tweaks or full out replacements to these on every repo or subproject.
Then we want to do things like update the "root" system prompt and have that applied everywhere.
There are analogies in git, CMS templating systems, software package interfaces and versioning. Doing it all with plain text doesn't feel right to me.
Any other approaches to this problem? Or is Beagle and ASTs and CDRTs really onto something here?
Haven’t totally cracked the nut yet either but the patterns ive had the best luck with are…
“Vibe” with vanilla HTML/CSS/JS. Surprisingly good at making the first version functional. No build step is great for iteration speed.
“Serious” with Go, server side template rendering and handlers, with go-rod (Chrome Devtools Protocol driver) testing components and taking screenshots. With a a skill and some existing examples it crunches and makes good tested components. Single compiled language is great for correctness and maintenance.
Go and its long established conventions and tools continues to be a massive boon to my agentic coding.
We have `go run main.go` as the convention to boot every apps dev environment, with support for multiple work trees, central config management, a pre-migrated database and more. Makes it easy and fast to dev and test many versions of an app at once.
I’m looking for feedback, testing and possible security engineering contracts for the approach we are taking at Housecat.com.
The agent accesses everything through a centralized connections proxy. No direct API tokens or access.
This means we can apply additional policies and approval workflows and audit all access.
https://housecat.com/docs/v2/features/connection-hub
Some obvious ones are only grant read and draft permissions at all, and review and send drafts manually.
Some more clever ones are to only allow sending 5 messages a day, or enforcing soft delete patterns. This prevents accidentally spamming everyone or deleting things.
Next up is giving the agent “wrapped” and down scoped tokens you do want to equip it with the ability to do direct API calls. But these still go through the proxy that enforces the policies too.
reply