They may be, but there are lots of languages, lots of approaches, lots of methodologies and just a ton of different ways to "code", coding isn't one homogeneous activity that one model beats all the other models at.
> what specific tasks is one performing better than the other?
That's exactly why you create your own benchmark, so you can figure that out by just having a list of models, instead of testing each individually and basing it on "feels better".
You probably can't replace a seasoned COBOL programmer with a seasoned Haskell programmer. Does that mean that either person is bad at programming as a whole?
You don't need to use the same model/system for every task. "AI" isn't a monolith; there's a spectrum of solutions for a spectrum of problems, and figuring out what's applicable to your problem today is one of the larger problems of deployment.
> what specific tasks is one performing better than the other?
That's exactly why you create your own benchmark, so you can figure that out by just having a list of models, instead of testing each individually and basing it on "feels better".