More

nzoschke · 2026-03-18T19:50:29 1773863429

We are in the wild wild west.

I’m looking for feedback, testing and possible security engineering contracts for the approach we are taking at Housecat.com.

The agent accesses everything through a centralized connections proxy. No direct API tokens or access.

This means we can apply additional policies and approval workflows and audit all access.

https://housecat.com/docs/v2/features/connection-hub

Some obvious ones are only grant read and draft permissions at all, and review and send drafts manually.

Some more clever ones are to only allow sending 5 messages a day, or enforcing soft delete patterns. This prevents accidentally spamming everyone or deleting things.

Next up is giving the agent “wrapped” and down scoped tokens you do want to equip it with the ability to do direct API calls. But these still go through the proxy that enforces the policies too.

nzoschke · 2026-03-16T16:36:27 1773678987

The industry is talking in circles here. All you need is "composability".

UNIX solved this with files and pipes for data, and processes for compute.

AI agents are solving this this with sub-agents for data, and "code execution" for compute.

The UNIX approach is both technically correct and elegant, and what I strongly favor too.

The agent + MCP approach is getting there. But not every harness has sub-agents, or their invocation is non-deterministic, which is where "MCP context bloat" happens.

Source: building an small business agent at https://housecat.com/.

We do have APIs wrapped in MCP. But we only give the agent BASH, an CLI wrapper for the MCPs, and the ability to write code, and works great.

"It's a UNIX system! I know this!"

dirk94018 · 2026-03-17T17:39:25 1773769165

Unix approach can be surprisingly powerful.

https://linuxtoaster.com/blog/gradientdescentforcode.html

ycombiredd · 2026-03-18T08:28:49 1773822529

What's interesting to me is that while it was obvious to all of us who came to think in the Unix Way, that insofar as composability, usage discoverability, and gobs of documentation in posts and man pages that are hugely represented in training corpora for LLMs, that the CLI is a great fit for LLM tool use, it seems only a recent trend to acknowledge this (and also the next hype wave, perhaps.)

Also interesting that while the big vendors are following this trend and are now trying to take a lead in it, they still suggest things like "but use a JSON schema" (the linked article does a bit of the same - acknowledging that incremental learning via `--help` is useful AND can be token-conserving (exception being that if they already "know" the correct pattern, they wouldn't need to use tokens to learn it, so there is a potential trade-off), they are also suggesting that LLMs would prefer to receive argument knowledge in json rather than in plain language, even though the entire point of an LLM is for understand and create plain language. Seemed dubious to me, and a part of me wondered if that advice may be nonsense motivated by desire to sell more token use. I'm only partially kidding and I'm still dubious of the efficacy.

* Here's a TL;DR for anyone who wants to skip the rest of this long message: I ran an LLM CLI eval in the form of a constructed CTF. Results and methodology are in the two links in the section linked: https://github.com/scottvr/jelp?tab=readme-ov-file#what-else

Anyhow... I had been experimenting with the idea of having --help output json when used by a machine, and came up with a simple module that exposes `--help` content as json, simply by adding a `--jelp` argument to any tool that already uses argparse.

In the process, I started testing, to see if all this extra machine-readable content actually improved performance, what it did to token use, etc. While I was building out test, trying to settle on legitimate and fair ways to come to valid conclusions, I learned of the OpenCLI schema draft, so I altered my `jelp` output to fit that schema, and set about documenting the things I found lacking from the schema draft, meanwhile settling to include these arg-related items as metadata in the output.

I'll get to the point. I just finished cleaning the output up enough to put it in a public repo, because my intent is to share my findings with the OpemCLI folks, in hopes that they'll consider the gaps in their schema compared to what's commonly in use, but at the same time, what came as a secondary thought in service of this little tool I called "jelp", is a benchmarking harness (and the first publishable results from it), the to me, are quite interesting and I would be happy if others found it to be and added to the existing test results with additional runs, models, or ideas for the harness, or criticism about the validity of the method, etc.

The evaluation harness uses constructed CLI fixtures arranged as little CLI CTF's, where the LLMs demonstrate their ability to use an unknown CLI be capturing a "flag" that they'll need to discover by using the usage help, and a trail of learned arguments.

My findings at first confirmed my intuitions, which was disappointing but unsurprising. When testing with GPT-4.1-mini, no manner of forcing them to receive info about the CLI via json was more effective than just letting them use the human-friendly plain English output of --help, and in all cases the JSON versions burned more tokens. I was able to elicit better performance by some measurements from 5.1-mini, but again the tradeoff was higher token burn.

I'll link straight to the part of the README that shows one table of results, and contains links to the LLM CLI CTF part of the repo, as well as the generated report after the phase-1 runs; all the code to reproduce or run your own variation is there (as well as the code for the jelp module, if there is any interest, but it's the CLI CTF eval that I expect is more interesting to most.)

https://github.com/scottvr/jelp?tab=readme-ov-file#what-else

nzoschke · 2026-03-15T21:08:43 1773608923

It’s going very well.

Experience level: very senior, programming for 25 years, have managed platform teams at Heroku and Segment.

Project type: new startup started Jan ‘26 at https://housecat.com. Pitch is “dev tools for non developers”

Team size: currently 2.

Stack: Go, vanilla HTML/CSS/JS, Postgres, SQLite, GCP and exe.dev.

Claude code and other coding harnesses fully replaced typing code in an IDE over the past year for me.

I’ve tried so many tools. Cursor, Claude and Codex, open source coding agents, Conductor, building my own CLIs and online dev environments. Tool churn is a challenge but it pays dividends to keep trying things as there have been major step functions in productivity and multi tasking. I value the HN community for helping me discover and cut through the space.

Multiple VMs available over with SSH with an LLM pre-configured has been the latest level up.

Coding is still hard work designing tests, steering agents, reviewing code, and splitting up PRs. I still use every bit of my experience every day and feel tired at end of day.

My non-programmer co-founder, more of a product manager and biz ops person, has challenges all the time. He generally can only write functional prototypes. We solve this by embracing the functional prototype and doing a lot of pair programming. It is much more productive than design docs or Figma wireframes.

In general the game changer is how much a couple of people can get done. We’re able to prototype ideas, build the real app, manage SOC2 infra, marketing and go to market better than ever thanks to the “willing interns” we have. I’ve done all this before and the AI helps with so much of the boilerplate and busywork.

I’m looking for beta testers and security researchers for the product, as well as a full time engineer if anyone is interested in seeing what a “greenfield” product, engineering culture and business looks like in 2026. Contact info in my profile.

mrmansano · 2026-03-15T21:23:09 1773609789

Interesting premise for your product. Hope you find success! From a dev perspective I feel your website pass a vibe more of a "OpenClaw you can trust" than "dev tool for non developers". Is that right? Or am I misreading the idea?

nzoschke · 2026-03-16T04:13:45 1773634425

Thanks. Yes that’s a proper take.

The OpenClaw stuff is awesome but it’s too raw for a lot of professionals and small teams. We’re trying to bring more guardrails to the concept and more of a Ruby on Rails philosophy to how it works.

nzoschke · 2026-03-08T18:08:14 1772993294

In https://replicated.wiki/blog/partII this part is very interesting to me:

> Want to keep LLM .md files in a separate overlay, only make them visible on request? Also easy. CRDT gives the freedom in splitting and joining along all the axes.

I now have a bunch of layers of text / markdown: system prompts, AGENTS.md, SKILL.md, plus user tweaks or full out replacements to these on every repo or subproject.

Then we want to do things like update the "root" system prompt and have that applied everywhere.

There are analogies in git, CMS templating systems, software package interfaces and versioning. Doing it all with plain text doesn't feel right to me.

Any other approaches to this problem? Or is Beagle and ASTs and CDRTs really onto something here?

nzoschke · 2026-02-19T14:42:33 1771512153

Haven’t totally cracked the nut yet either but the patterns ive had the best luck with are…

“Vibe” with vanilla HTML/CSS/JS. Surprisingly good at making the first version functional. No build step is great for iteration speed.

“Serious” with Go, server side template rendering and handlers, with go-rod (Chrome Devtools Protocol driver) testing components and taking screenshots. With a a skill and some existing examples it crunches and makes good tested components. Single compiled language is great for correctness and maintenance.

nzoschke · 2026-02-17T20:44:57 1771361097

Go and its long established conventions and tools continues to be a massive boon to my agentic coding.

We have `go run main.go` as the convention to boot every apps dev environment, with support for multiple work trees, central config management, a pre-migrated database and more. Makes it easy and fast to dev and test many versions of an app at once.

See https://github.com/housecat-inc/cheetah for the shared tool for this.

Then of course `go generate`, `go build`, `go test` and `go vet` are always part of the fast dev and test loop. Excited to add `go fix` into the mix.

hu3 · 2026-02-17T23:29:06 1771370946

And the screamingly fast compilation speed is a boon to fast LLM iterations as well.

nzoschke · 2026-02-10T19:11:54 1770750714

go-rod has been instrumental to my agentic coding loops too. Some uses:

- E2E testing of browser components

- Taking screenshots before and after and having Claude look at them to double check things

- Driving it with an API and CLI as a headless browser

Will definitely give Rodney a look.

nzoschke · 2026-02-03T14:26:29 1770128789

Are there good techniques for testing / benchmarking skills effectiveness?

nzoschke · 2026-01-19T22:43:32 1768862612

Counter argument...

High velocity teams also observe production system telemetry and use error rates, tracing and more to maintain high SLAs for customers.

They set a "budget" and use feature flagging to release risky code and roll back or roll forward based on metrics.

So agentic coding can feed back on observed behaviors in production too.

ithkuil · 2026-01-20T06:55:11 1768892111

It's definitely an area where we'll all learn a lot in the upcoming years.

But we have to use this "innovation budget" in a careful way.

SilenN · 2026-01-20T02:50:58 1768877458

^this

nzoschke · 2026-01-15T20:43:52 1768509832

Neat! I have experimented with various ways to convert Go structs to/from TypeScript interfaces.

- https://quicktype.io/

- https://github.com/danielgtaylor/huma then OpenAPI -> TS

- https://github.com/gzuidhof/tygo

- https://github.com/coder/guts

guts also does some AST stuff.

It seems like this project could help with this and then some.