coding? they are coding models? what specific tasks is one performing better tha...

diggan · 2025-07-29T15:51:18 1753804278

They may be, but there are lots of languages, lots of approaches, lots of methodologies and just a ton of different ways to "code", coding isn't one homogeneous activity that one model beats all the other models at.

> what specific tasks is one performing better than the other?

That's exactly why you create your own benchmark, so you can figure that out by just having a list of models, instead of testing each individually and basing it on "feels better".

reverius42 · 2025-07-30T07:15:34 1753859734

> coding isn't one homogeneous activity that one model beats all the other models at

If you can't even replace one coding model with another, it's hard to imagine you can replace human coders with coding models.

Philpax · 2025-07-30T11:52:47 1753876367

You probably can't replace a seasoned COBOL programmer with a seasoned Haskell programmer. Does that mean that either person is bad at programming as a whole?

reverius42 · 2025-07-30T13:06:15 1753880775

This was my point -- if programmers are not fungible, how can companies claim to be replacing them by the thousands with AI?

Philpax · 2025-07-30T17:42:19 1753897339

You don't need to use the same model/system for every task. "AI" isn't a monolith; there's a spectrum of solutions for a spectrum of problems, and figuring out what's applicable to your problem today is one of the larger problems of deployment.

diggan · 2025-07-30T10:35:27 1753871727

What you mean "can't even replace"? You can, nothing in my comment says you cannot?

whimsicalism · 2025-07-29T15:54:57 1753804497

glm 4.5 is not a coding model

simonw · 2025-07-29T15:58:28 1753804708

It may not be code-only, but it was trained extensively for coding:

> Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model's performance on key downstream domains.

From my notes here: https://simonwillison.net/2025/Jul/28/glm-45/

whimsicalism · 2025-07-29T16:00:28 1753804828

yes, all reasoning models currently are, but it’s not like ds coder or qwen coder

simonw · 2025-07-29T16:02:43 1753804963

I don't see how the training process for GLM-4.5 is materially different from that used for Qwen3-235B-A22B-Instruct-2507 - they both did a ton of extra reinforcement learning training related to code.

Am I missing something?

whimsicalism · 2025-07-29T16:23:17 1753806197

I think the primary thing you're missing is that Qwen3-235B-A22B-Instruct-2507 != Qwen3-Coder-480B-A35B-Instruct. And the difference there is that while both do tons of code RL, in one they do not monitor performance on anything else for forgetting/regression and focus fully on code post-training pipelines and it is not meant for other tasks.