If you benchmark these, you'll likely find the version with the jump edges out t...

jeffbee · on Jan 16, 2024

FYI. https://quick-bench.com/q/sK9t9GoFDRkx9XxloUUbB8Q3ht4'

Using this microbenchmark on an Intel Sapphire Rapids CPU, compiled with march=k8 to get the older form, takes ~980ns, while compiling with march=native gives ~570ns. It's not at all clear that the imperfection the article describes is really relevant in context, because the compiler transforms this function into something quite different.

fooker · on Jan 16, 2024

With random test cases, branch prediction can't help.

pclmulqdq · on Jan 16, 2024

Compilers often under-generate conditional instructions. They implicitly assume (correctly) that most branches you write are 90/10 (ie very predictable), not 50/50. The branches that actually are 50/50 suffer from being treated as being 90/10.

fooker · on Jan 16, 2024

The branches in this example are not 50/50.

Given a few million calls of clamp, most would be no-ops in practice. Modern CPUs are very good at dynamically observing this.

pclmulqdq · on Jan 17, 2024

Do you know that for a fact? For all calls of clamp? I have definitely used min and max when they are true 50/50s and I assume clamp also gets some similar use.

fooker · on Jan 17, 2024

Modern compilers generate code assuming all branches are highly predictable.

If your use case does not follow that pattern and you really care about performance, you have to pull out something like inline assembly.

Consider software like ffmpeg which have to do this for the sake of performance.

IainIreland · on Jan 16, 2024

It's hard to predict statically which branches will be dynamically unpredictable.

A seasoned hardware architect once told me that Intel went all-in on predication for Itanium, under the assumption that a Sufficiently Smart Compiler could figure it out, and then discovered to their horror that their compiler team's best efforts were not Sufficiently Smart. He implied that this was why Intel pushed to get a profile-guided optimization step added to the SPEC CPU benchmark, since profiling was the only way to get sufficiently accurate data.

I've never gone back to see whether the timeline checks out, but it's a good story.

fooker · on Jan 16, 2024

The compiler doesn't do much of the predicting, it's done by the CPU in runtime.

kyboren · on Jan 16, 2024

Not prediction, predication: https://en.wikipedia.org/wiki/Predication_(computer_architec...

By avoiding conditional branches and essentially masking out some instructions, you can avoid stalls and mis-predictions and keep the pipeline full.

Actually I think @IainIreland mis-remembers what the seasoned architect told him about Itanium. While Itanium did support predicated instructions, the problematic static scheduling was actually because Itanium was a VLIW machine: https://en.wikipedia.org/wiki/VLIW .

TL;DR: dynamic scheduling on superscalar out-of-order processors with vector units works great and the transistor overhead got increasingly cheap, but static scheduling stayed really hard.

svantana · on Jan 16, 2024

That must depend on the platform and the surrounding code, no?

fooker · on Jan 16, 2024

Yes. On platform - most modern cpus are happier with predictable branches than exotic instructions.

On surrounding code - for sure.