Hacker Newsnew | past | comments | ask | show | jobs | submit | kxbnb's commentslogin

I built Axiomo because every AI code review tool I tried kept solving the wrong problem. They all want to be a second developer on your PR, catching lint issues and suggesting refactors. That's not what makes review hard.

What makes it hard is context. Who wrote this, have they worked in this area before, what are they actually trying to do, and where should I focus?

Axiomo generates a structured Signal for every PR that answers those questions. Contributor context, intent, risk drivers, CI evidence, focus files. No auto-generated code suggestions. No comment spam. Just the stuff that helps you actually review instead of skim and approve.

Stack is Next.js on Vercel, Neon for Postgres, Resend for email. GitHub integration. Would love feedback on the approach and what your review workflow looks like.


The framing of craft vs. slop misses something important: most production software quality problems aren't about aesthetics or elegance, they're about correctness under real-world conditions.

I've been using AI coding tools heavily for the past year. They're genuinely useful for the "plumbing" - glue code, boilerplate, test scaffolding. But where they consistently fail is reasoning about system-level concerns: authorization boundaries, failure modes, state consistency across services.

The article mentions AI works best on "well-defined prompts for already often-solved problems." This is accurate. The challenge is that in production, the hard problems are rarely well-defined - they emerge from the interaction between your code and reality: rate limits you didn't anticipate, edge cases in user behavior, security assumptions that don't hold.

Craft isn't about writing beautiful code. It's about having developed judgment for which corners you can't cut - something that comes from having been burned by the consequences.


> Craft isn't about writing beautiful code. It's about having developed judgment for which corners you can't cut - something that comes from having been burned by the consequences.

That's why I'm of the opinion that for senior developers/architects, these coding agents are awesome tools.

For a junior developer? Unless they are of the curious type and develop the systems-level understanding on their own... I'd say there's a big chance the machine is going to replace their job.


Most people using LLMs dont have this craft...... which begs the question. Should they be using LLMs in the first place? Nope. But given that its rammed down their throat by folks internally and externally, they will.


Cool project. The agent shorthand/jargon detection is a unique angle - I haven't seen other tools focus on that specifically.

Re: your question about observability pain points - the one I keep hitting is visibility at the external API boundary. Most agent observability (including OTEL-based traces) shows what the agent intended to send, but not necessarily what actually hit the wire when calling external services.

When an agent makes a tool call that hits Stripe, Shopify, or any third-party API, you want to see the actual HTTP request/response - not just the function call in your trace. Especially for debugging "works locally, fails in prod" scenarios or when the vendor says "your request was malformed."

I built toran.sh for this - transparent proxy that captures wire-level requests to external APIs. Complements tools like InsAIts since you get the inter-agent communication view and the external boundary view.

What's your take on capturing outbound API calls vs focusing on agent-to-agent communication?


The "output validation not just input validation" point is underrated. Most guardrails focus on what goes into the model, but the real risk is what comes out and gets executed.

We're working on similar problems at keypost.ai - policy enforcement at the tool-calling boundary for MCP pipelines. Different angle (we're more focused on access control and rate limits than hallucination detection per se) but same philosophy of deterministic enforcement.

Question: how do you handle the tension between semantic enforcement and false positives? In our experience, too-strict semantic rules block legitimate use cases, but too-loose lets things through. Any patterns that worked for calibrating that?


Nice execution on the replay testing with semantic diff - that's a pain point that's hard to solve with just metrics.

One thing I've noticed building toran.sh (HTTP-level observability for agents): there's a gap between "what the agent decided to do" (your trace level) and "what actually went over the wire" (raw requests/responses). Especially with retries, timeouts, and provider failovers - the trace might show success but the HTTP layer tells a different story.

Do you capture the underlying HTTP calls, or is it primarily at the SDK/trace level? Asking because debugging often ends up needing both views.


Thanks, and great point. Right now, Lumina is mainly SDK/trace-level (what the app thinks happened: tokens, cost, latency, outputs), so you’re right that low-level HTTP details like retries/timeouts/failovers can be partially hidden. Capturing the raw HTTP layer alongside traces is on our roadmap because production debugging often needs both views. Also, your “see what your agent is actually doing” angle is spot-on. There’s a lot of opaque magic in agent frameworks. Curious how you’re doing it in toran.sh proxy/intercept, or wrapping the SDK HTTP client?


Your framing of the problem resonates - treating the LLM as untrusted is the right starting point. The CAR spec sounds similar to what we're building at keypost.ai.

On canonicalization: we found that intercepting at the tool/API boundary (rather than parsing free-form output) sidesteps most aliasing issues. The MCP protocol helps here - structured tool calls are easier to normalize than arbitrary text.

On stateful intent: this is harder. We're experimenting with session-scoped budgets (max N reads before requiring elevated approval) rather than trying to detect "bad sequences" semantically. Explicit resource limits beat heuristics.

On latency: sub-10ms is achievable for policy checks if you keep rules declarative and avoid LLM-in-the-loop validation. YAML policies with pattern matching scale well.

Curious about your CAR spec - are you treating it as a normalization layer before policy evaluation, or as the policy language itself?


Interesting approach to the instruction bloat problem. The composable skills idea makes sense - 500 tokens vs 10K is a real difference.

One thing I'd be curious about: how do you think about security when skills auto-provision based on stack detection? If a skill gets compromised upstream, the auto-sync could propagate it quickly.

We're working on policy enforcement for MCP at keypost.ai and thinking about similar trust questions - what should be allowed to load/execute vs what needs explicit approval.


Hi, thanks for reaching out, yep a big issue... not only for skills but all dependencies. One of the options I see is governance... rely on a trusted listing where you with your experts curate/validate/assess and select the ones that match your quality standards. The current MCP already supports that, you just need to change the listing json file, right now i use this listing https://github.com/dmgrok/agent_skills_directory that goes get the skills to some "trusted" repos, anthropic, vercel, github.

how are you dealing with this topic at keypost.ai?


The insight about environment attacks vs. model attacks is critical. "The model functioned correctly, yet the overall agent system remained compromised because it trusted its tools' outputs."

This is why I've been focused on boundary visibility. Agents are opaque until they hit real tools - and if you can't see what's actually being sent/received at each boundary, you can't detect manipulation.

We built toran.sh to provide that inspection layer - read-only proxies that show the actual wire-level request/response. Doesn't prevent attacks, but makes them visible.

Curious what detection mechanisms you're recommending alongside the attack framework?


Nice work - the "deploy-friendly guardrails" framing resonates. Too many MCP tools assume local dev only.

To your question about what bites first: in our experience at keypost.ai, the order is usually:

1. *Auth* - OAuth token refresh edge cases, especially when agents run long tasks that span token expiry 2. *Rate limits* - not having them, then having them but too coarse (per-tool vs per-endpoint vs per-argument) 3. *Observability* - specifically, correlating agent intent with actual tool calls when debugging why something failed 4. *Sandboxing* - usually comes up after the first "oops" moment

One pattern we've found useful: separating "can this identity call this tool" (auth) from "should this specific call be allowed" (policy). They're often conflated but have different failure modes and different owners (security team vs product team).

Curious how you're handling policy in PolyMCP - is it config-driven or code-driven?


Sorry for the late response. Thanks a lot for your suggestion! Policy now: config-driven allow/deny lists + metadata constraints + budgets/redaction. Auth separate (OAuth2 auto-refresh). Planning hooks.

How do you do policy at keypost.ai?


The middleware proxy approach unop mentioned is the right pattern - you need an enforcement point the agent can't bypass.

At keypost.ai we're building exactly this for MCP pipelines. The proxy evaluates every tool call against deterministic policy before it reaches the actual tool. No LLM in the decision path, so no reasoning around the rules.

Re: chrisjj's point about "fix the program" - the challenge is that agents are non-deterministic by design. You can't unit test every possible action sequence. So you need runtime enforcement as a safety net, similar to how we use IAM even for well-tested code.

The human-in-the-loop suggestion works but doesn't scale. What we're seeing teams want is conditional human approval - only trigger review when the action crosses a risk threshold (first time deleting in prod, spend over $X, etc.), not for every call.

The audit trail gap is real. Most teams log tool calls but not the policy decision. When something goes wrong, you want to know: was it allowed by policy, or did policy not cover this case?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: