I mean at this point can we just conclude that there are a group of engineers who claim to have incredible success with it and a group that claim it is unreliable and cannot be trusted to do complex tasks.
I struggle to believe that a ton of seemingly intelligent software engineers are too dumb to figure out how to use Claude code to get reliable results, it seems much more likely to me that it can do well at isolated tasks or new projects but fails when pointed at large complex code bases because it just... is a token predictor lol.
But yeah spinning up a green fields project in an extensively solved area (ledgers) is going to be something an AI shines at.
It isn't like we don't use this stuff also, I ask Cursor to do things 20x a day and it does something I don't like 50% of the time. Even things like pasting an error message it struggles with. How do I reconcile my actual daily experience with hype messages I see online?
Right, I keep seeing people talking past each other in this same way. I don't doubt folks when they say they coded up some greenfield project 10x faster with Claude, it's clearly great at many of those tasks! But then so many of them claim that their experience should translate to every developer in every scenario, to the point of saying they must be using it wrong if they aren't having the same experience.
Many software devs work in teams on large projects where LLMs have a more nuanced value. I myself mostly work on a large project inside a large organization. Spitting out lines of code is practically never a bottleneck for me. Running a suite of agents to generate out a ton of code for my coworkers to review doesn't really solve a problem that I have. I still use Claude in other ways and find it useful, but I'm certainly not 10x more productive with it.
> But yeah spinning up a green fields project in an extensively solved area (ledgers) is going to be something an AI shines at.
I couldn't disagree with this more. It's impressive at building demos, but asking it to build the foundation for a long-term project has been disastrous in my experience.
When you have an established project and you're asking it to color between the lines it can do that well (most of the time), but when you give it a blank canvas and a lot of autonomy it will likely end up generating crap code at a staggering pace. It becomes a constant fight against entropy where every mess you don't clean up immediately gets picked up as "the way things should be done" the next time.
Before someone asks, this is my experience with both Claude Code (Sonnet/Opus 4.6) and Codex (GPT 5.4).
I suspect many people here have tried it, but they expected it to one-shot any prompt, and when it didn't, it confirmed what they wanted to be true and they responded with "hah, see?" and then washed their hands of it.
So it's not that they're too stupid. There are various motivations for this: clinging on to familiarity, resistance to what feels like yet another tool, anti-AI koolaid, earnestly underwhelmed but don't understand how much better it can be, reacting to what they perceive to be incessant cheerleading, etc.
It's kind of like anti-Javascript posts on HN 10+ years ago. These people weren't too stupid to understand how you could steelman Node.js, they just weren't curious enough to ask, and maybe it turned out they hadn't even used Javascript since "DHTML" was a term except to do $(".box").toggle().
Hypothetically, you have a simple slice out of bounds error because a function is getting an empty string so it does something like: `""[5]`.
Opus will add a bunch of length & nil checks to "fix" this, but the actual issue is the string should never be empty. The nil checks are just papering over a deeper issue, like you probably need a schema level check for minimum string length.
At that point do you just tell it like "no delete all that, the string should never be empty" and let it figure that out, or do I basically need to pseudo code "add a check for empty strings to this file on line 145", or do I just YOLO and know the issue is gone now so it is no longer my problem?
My bigger point is how does an LLM know that this seemingly small problem is indicative of some larger failure, like lets say this string is a `user.username` which means users can set their name to empty which means an entire migration is probably necessary. All the AI is going to do is smoosh the error messages and kick the can.
1. I'm working in Rust, so it's a very safe and low-defect language. I suspect that has a tremendous amount to do with my successes. "nulls" (Option<T>) and "errors" (Result<T,E>) must be handled, and the AST encodes a tremendous amount about the state, flow, and how to deal with things. I do not feel as comfortable with Claude Code's TypeScript and React outputs - they do work, but it can be much more imprecise. And I only trust it with greenfield Python, editing existing Python code has been sloppy. The Rust experience is downright magical.
2. I architecturally describe every change I want made. I don't leave it up to the LLM to guess. My prompts might be overkill, but they result in 70-80ish% correctness in one shot. (I haven't measured this, and I'm actually curious.) I'll paste in file paths, method names, struct definitions and ask Claude for concrete changes. I'll expand "plumb foo field through the query and API layers" into as much detail as necessary. My prompts can be several paragraphs in length.
3. I don't attempt an entire change set or PR with a single prompt. I work iteratively as I would naturally work, just at a higher level and with greater and broader scope. You get a sense of what granularity and scope Claude can be effective at after a while.
You can't one shot stuff. You have to work iteratively. A single PR might be multiple round trips of incremental change. It's like being a "film director" or "pair programmer" writing code. I have exacting specifications and directions.
The power is in how fast these changes can be made and how closely they map to your expectations. And also in how little it drains your energy and focus.
This also gives me a chance to code review at every change, which means by the time I review the final PR, I've read the change set multiple times.
I have encountered the exact same kind of frustration, and no amount of prompting seems to prevent it from "randomly" happening.
`the error is on line #145 fix it with XYZ and add a check that no string should ever be blank`
It's the randomness that is frustrating, and that the fix would be quicker to manually input that drives me crazy. I fear that all the "rules" I add to claude.md is wasting my available tokens it won't have enough room to process my request.
Yup, this is why i firmly believe true productivity, as in, it aiding you to make you faster, is limited by the speed of review.
I think Claude makes me faster, but the struggle is always centered around retaining own context and reviewing code fully. Reviewing code fully to make sure it’s correct and the way I want it, retaining my own context to speed up reviews and not get lost.
I firmly believe people who are seeing massive gains are simply ignoring x% lines of code. There’s an argument to be made for that being acceptable, but it’s a risk analysis problem currently. Not one I subscribe to.
Use planning+execution rather than one-shotting, it'll let you push back on stuff like this. I recommend brainstorming everything with https://github.com/obra/superpowers, at least to start with.
Then work on making sure the LLM has all the info it needs. In this example it sounds like perhaps your hypothetical data model would need to be better typed and/or documented.
But yeah as of today it won't pick up on smells as you do, at least not without extra skills/prompting. You'll find that comforting or annoying depending on where you stand...
Always start an implementation in Claude Code plan mode. It's much more comprehensive than going straight to impl. I never read their prompt for plan mode before, but it deep-dives the code, peripheral files, callsites, documentation, existing tests, etc.
You get a better solution but also a plan file that you can review. And, also important, have another agent review. I've found that Codex is really good at reviewing plans.
I have an AGENTS.md prompt that explains that plan file review involves ranking the top findings by severity, explaining the impact, and recommending a fix to each one. And finally recommend a simpler directional pivot if one exists for the plan.
So, start the plan in Claude Code, type "Review this plan: <path>" in Codex (or another Claude Code agent), and cycle the findings back into Claude Code to refine the plan. When the plan is updated, write "Plan updated" to the reviewer agent.
You should get much better results with this capable of much better arch-level changes rather than narrow topical solutions.
If that's still not working sufficiently for you, maybe you could use more support, like a type-system and more goals in AGENTS.md?
IMO, plan mode is pretty useless. For bug fixes and small improvements, I already know where to edit (and can do it quickly with vim-fu).
For new features, I spend a bit of time thinking, and I can usually break it down in smaller tasks that are easy to code and verify. No need to wrangle with Plan mode and a big markdown file.
I can usually get things one-shotted by that point if I bother with the agent.
My manager and I have been experimenting with it for some stuff, and our most recent attempt at using plan mode was a refactor to change a data structure and make some conversion code unnecessary, then delete it. The plan looked fine, but after it ran the data structure change was incomplete, most of the conversion code was still there, and it introduced several bugs by changing lines it shouldn't have touched at all. Also removed several "why" style comments and arbitrarily changed variable names to be less clear in code it otherwise didn't change.
This was the costliest one we had access to, chosen as an experiment - took $20 over almost a half hour to run.
We reviewed the plan manually, asked it a few questions to clarify parts, and manually tweaked other parts.
I didn't catch what it was, some web dashboard that showed the cost per prompt. We could see it going up as it ran. We were just using the plan our company provided.
Not the person you're replying to but yes, sometimes I do tell the agent to remove the cruft. Then I back up a few messages in the context and reword my request. Instead of just saying "fix this crash", or whatever, I say "this is crashing because the string is empty, however it shouldn't be empty, figure out why it's empty". And I might have it add some tests to ensure that whatever code is not returning/passing along empty strings.
“I struggle to believe that a ton of seemingly intelligent software engineers are too dumb to figure out how to use Claude code to get reliable results”
Seemingly is doing the heavy lifting here. If you read enough comment threads on HN, it will become obvious why they aren’t getting results.
> I struggle to believe that a ton of seemingly intelligent software engineers are too dumb to figure out how to use Claude code to get reliable results.
They're not dumb, but I'm not surprised they're struggling.
A developer's mindset has to change when adding AI into the mix, and many developers either can’t or won’t do that. Developers whose commits that look something like "Fixed some bugs" probably aren’t going to take the time to write a decent prompt either.
Whenever there's a technology shift, there are always people who can't or won't adapt. And let's be honest, there are folks whose agenda (consciously or not) is to keep the status quo and "prove" that AI is a bad thing.
No wonder we're seeing wildly different stories about the effectiveness of coding agents.
Here's my 100 file custom scaffolding AI prompt that I've been working on for the last four months, and can reliably one-shot most math olympic problems and even a rust to do list.
I see two basic cases for the people who are claiming it is useless at this point.
One is that they tried AI-based coding a year or two ago, came to the IMHO completely correct at that time conclusion that it was nearly useless, and have not tried it since then to see that the situation has changed. To which the solution is, try it again. It changed a lot.
The other are those who have incorporated into their personal identity that they hate AI and will never use it. I have seen people do things like fire AI at a task they have good reasons to believe it will fail at, and when it does, project that out to all tasks without letting themselves consciously realize that picking a bad task on purpose skews the deck.
To those people my solution is to encourage them to hold on to their skepticism. I try to hold on to it as well despite the incredible cognitive temptation not to. It is very useful. But at the same time... yeah, there was a step change in the past year or so. It has gotten a lot more useful...
... but a lot of that utility is in ways that don't obviate skilled senior coding skills. It likes to write scripting code without strong types. Since the last time I wrote that, I have in fact used it in a situation where there were enough strong types that it spontaneously originated some, but it still tends to write scripting code out of that context no matter what language it is working in. It is good at very straight-line solutions to code but I rarely see it suggest using databases, or event sourcing, or a message bus, or any of a lot of other things... it has a lot of Not Invented Here syndrome where it instead bashes out some minimal solution that passes the unit tests with flying colors but can't be deployed at scale. No matter how much documentation a project has it often ends up duplicating code just because the context window is only so large and it doesn't necessarily know where the duplicated code might be. There's all sorts of ways it still needs help to produce good output.
I also wonder how many people are failing to prompt it enough. Some of my prompts are basically "take this and do that and write a function to log the error", but a lot of my prompts are a screen or two of relevant context of the project, what it is we are trying to do, why the obvious solution doesn't work, here's some other code to look at, here's the relevant bugs and some Wiki documentation on the planning of the project, we should use {event sourcing/immutable trees/stored procedures/whatever}, interact with me for questions before starting anything. This is not a complete explanation of what they are doing anymore, but there's still a lot of ways in which what an LLM can really do is style transfer... it is just taking "take this and do that and write a function to log the error" and style-transforming that into source code. If you want it to do something interesting it really helps to give it enough information in the first place for the "style transfer" to get a hold of and do something with. Don't feel silly "explaining it to a computer", you're giving the function enough data to operate on.
I can see huge utility with AI as a guide and helper.
But not being one leg in the code myself is not something I am comfortable with. It starts feeling like management and not development. I really feel the abdication very strongly and it makes me unable and unwilling to put a hard stamp on quality. I have seen too much hallucination or half missed requirements to put that much trust in AI.
It's the same with code reviews of hard tickets. You can scroll past and just approve, but do you really understand what your colleague has built? Are you really in the driver's seat? It feels to me like YOLOing with major consequences.
I dont but, at all that people doing 20x output have any idea what they are coding. They are just pressing the yolo button and no one, not the engineer, not the AI and not management is in the driver's seat. it is a very scary time.
I struggle to believe that a ton of seemingly intelligent software engineers are too dumb to figure out how to use Claude code to get reliable results, it seems much more likely to me that it can do well at isolated tasks or new projects but fails when pointed at large complex code bases because it just... is a token predictor lol.
But yeah spinning up a green fields project in an extensively solved area (ledgers) is going to be something an AI shines at.
It isn't like we don't use this stuff also, I ask Cursor to do things 20x a day and it does something I don't like 50% of the time. Even things like pasting an error message it struggles with. How do I reconcile my actual daily experience with hype messages I see online?