I responded with a mix of mostly B and C answers and got “advanced.” Yet, as pointed out by another commenter, selecting all D answers (which would make you an expert!) gets you called a beginner.
I can only assume the quiz itself was vibe-coded and not tested. What an incredible time we live in.
Or that it's taking into account the Dunning-Kruger effect. In that, if you think you are an expert in all cases, you are really a beginner in everything.
I'm a beginner with agentic coding. I vibe code something most days, from a few lines up to refactors over a few files. I don't knowingly use skills, rarely _choose_ to call out to tools, haven't written any skills and only one or two ad hoc scripts, and have barely touched MCPs (because the few I've used seem flaky and erratic). I answered as such and got... intermediate.
> For tasks that would take a human under four minutes—small bug fixes, boilerplate, simple implementations—AI can now do these with near-100% success. For tasks that would take a human around one hour, AI has a roughly 50% success rate. For tasks over four hours, it comes in below a 10% success rate
Opus 4.6 now does 12hr tasks with 50% success. The METR time horizon chart is insane… exponential progression.
Really depends on what you're working in. For me, I work with a lot of data frameworks that are maybe underrepresented in these models' training sets and it still tends to get things wrong. The other issue is business logic is complex to describe in a prompt, to the point where giving it all the context and business logic for it to succeed is almost as much work as doing it myself. As a data engineer I still only find models to be useful with small chunks of code or filling in tedious boilerplate to get things moving.
Agreed. Common use cases like creating a simple LMS system Opus is shockingly good, saving hours upon hours from having to reinvent the wheel. Other things like simple queries to, and interactions with our ERP system it is still quite poor at, and increases development time rather than shortens it.
Just anecdotal but I work on some fairly left field service architectures; today it was a highly parallelized state machine processor operating on an in-house binary protocol.
Opus 4.6 had no issue correctly identifying and mitigating a hairy out-of-order state corruption issue involving a non-trivial sequence of runtime conditions from thrown errors and failed recoveries. This was simply from having access to the code repository and a brief description of the observed behavior that I provided. Naturally I verified it wasn't bullshitting me, and sure enough it was correct. Impressive really, given none of the specifics could have been in its training set, but I guess we're finding that nothing really is "new", just a remix of what's come before in various recombinations.
How is success defined in those metrics? Is success "perfect - can deploy to prod immediately" or "saved some arbitrary amount of engineering time"?
Anecdotal experience from my team of 15 engineers is we rarely get "perfect" but we do get enough to massive time savings across several common problem domains.
Being an effective market doesn’t mean you get everything you want.
You’re actually saying: “I want Apple’s software, and I want certain chips, and I want a certain form factor. And if Apple won’t build what I want, I will pass a law to make them build it for me!”
Come on man. You will make tradeoffs either way. The answer isn’t: force a company to build what I want them to build.
Well another version of it is: I want to be able to talk to my family, but I don't want to buy an iPhone. The EU rightly regulated that any chat network big enough must open their doors to different platforms. Or I don't want to buy Microsoft Office for my employees but I want to be able to do business with those who do, and thankfully we have relatively open document formats now.
The chips argument is contrived, the OS argument less so, but it's all just network effects at some level, and it's important for competition and effective markets that we prevent the largest networks from locking people in and forcing them to make a lot of other unrelated decisions.
iMessage being a closed ecosystem. Apple finally added RCS support, but only after regulatory pressure.
To not recognise this as a limitation is to be wilfully blind to network effects. The "green bubbles" issue was a huge issue in the US. Similarly, WhatsApp not being open is a huge problem in forcing people onto Meta's platforms.
Do you believe that states are the laboratories of democracy, and have rights, or do you believe that reducing the cost of regulatory compliance is a more important goal?
I take no position on this currently, but it's an important question that deserves a serious answer. Trading off the costs of "state experimentation" and "enforced regulatory conformity" is non-trivial to do.
To be clear, I wasn't exclusively referring to government. I was actually only thinking of the use of git-like version control across a number of different technical domains, law, design, book writing, architecture, etc
For example, there are thousands of divisions of government out there provisioning largely the same systems in duplicate. E.g. the very local government here has a web portal for the sports venue bookings like pools and tennis courts. They have a waste collection portal. Local tax portal.
Only recently has this been slightly standardized but even those efforts are purely regional. You might get 5 local councils in the city using one SaaS platform, another 5 using another SaaS platform, and another 5 rolling their own. For each function of local government.
Nevermind the fact that a local government in France like this probably has very similar needs to one in Belgium or even the US.
And the worst part is they are terrible at procurement so even when they do consolidate, they're basically getting scammed.
I often think about starting a cost-plus-priced open core project to deal with these issues. Like we build common government functions, and sell it for cost plus 20% markup, with a licence that lets the gov run it themselves if we ever go bust. But then I think procurement is largely a grift game and it might not do well for that reason.
Wouldn’t consolidation lead to monopoly? If 50 local governments use the same SaaS/vendor, the 51st local government would likely go for the same vendor just because 50 others used that vendor before them, no? What prevents the vendor from jacking up prices or general enshittification at the stage?
> What prevents the vendor from jacking up prices or general enshittification at the stage?
Well what I'm proposing building would be source-available and licensed such that the gov can run it themselves if it ever gets too expensive. The sub-gov entities should really band together for the negotiation though, then they can ask for whatever they want: non-profit vendors, liberal licensing, price agreements. A collective of government buyers form basically a monopsony larger than any individual vendor could ever be.
How is it not? It reads to me as them saying that all these devs have deskilled from "barely competent" to "completely helpless". Or is your claim that they were actually really good devs, and the deskilling has been even more intense than I'm picturing?
My personal experience is we're seeing a magnification of results. The slog is reading hundreds of files, updating some active code to remove some old function from 100k lines of code. Last week the modification, while trivial, would have took weeks and AI agents were able to correct the code with 100% verified accuracy in 20 minutes.
Yes, and they work really well for small side projects that an exec probably used to try out the LLM.
But writing code in one clean discrete repo is (esp. at a large org) only a part of shipping something.
Over time, I think tooling will get better at the pieces surrounding writing the code though. But the human coordination / dependency pieces are still tricky to automate.
“S&P had its worst X since Y”
Worst quarter in four years. Worst week since 2018. Worst 3 days since 2008.
It’s all kind of silly.
reply