It's interesting that Zen5's FPUs running in full 512bit wide mode doesn't actua...

yaantc · on March 1, 2025

On the L/S unit impact: data movement is expensive, computation is cheap (relatively).

In "Computer Architecture, A Quantitative Approach" there are numbers for the now old TSMC 45nm process: A 32 bits FP multiplication takes 3.7 pJ, and a 32 bits SRAM read from an 8 kB SRAM takes 5 pJ. This is a basic SRAM, not a cache with its tag comparison and LRU logic (more expansive).

Then I have some 2015 numbers for Intel 22nm process, old too. A 64 bits FP multiplication takes 6.4 pJ, a 64 bits read/write from a small 8 kB SRAM 4.2 pJ, and from a larger 256 kB SRAM 16.7 pJ. Basic SRAM here too, not a more expansive cache.

The cost of a multiplication is quadratic, and it should be more linear for access, so the computation cost in the second example is much heavier (compare the mantissa sizes, that's what is multiplied).

The trend gets even worse with more advanced processes. Data movement is usually what matters the most now, expect for workloads with very high arithmetic intensity where computation will dominate (in practice: large enough matrix multiplications).

Remnant44 · on March 1, 2025

Appreciate the detail! That explains a lot of what is going on.. It also dovetails with some interesting facts I remember reading about the relative power consumption for the zen cores versus the infinity fabric connecting them - The percentage of package power usage simply from running the fabric interconnect was shocking.

Earw0rm · on March 1, 2025

Right, but a SIMD single precision mul is linear (or even sub linear) relative to it's scalar cousin. So a 16x32, 512-bit MUL won't be even 16x the cost of a scalar mul, the decoder has to do only the same amount of work for example.

kimixa · on March 1, 2025

The calculations within each unit may be, true, but routing and data transfer is probably the biggest limiting factor on a modern chip. It should be clear that placing 16x units of non-trivial size means that the average will likely be further away from the data source than a single unit, and transmitting data over distances can have greater-than-linear increasing costs (not just resistance/capacitance losses, but to hit timing targets you need faster switching, which means higher voltages etc.)

dzaima · on March 2, 2025

Both Intel and AMD to some extent separate the vector ALUs and the register file in 128-bit (or 256-bit?) lanes, across which arithmetic ops won't need to cross at all. Of course loads/stores/shuffles still need to though, making this point somewhat moot.

eigenform · on March 1, 2025

AFAIK you have to think about how many different 512b paths are being driven when this happens, like each cycle in the steady-state case is simultaneously (in the case where you can do two vfmadd132ps per cycle):

- Capturing 2x512b from the L1D cache

- Sending 2x512b to the vector register file

- Capturing 4x512b values from the vector register file

- Actually multiplying 4x512b values

- Sending 2x512b results to the vector register file

.. and probably more?? That's already like 14*512 wires [switching constantly at 5Ghz!!], and there are probably even more intermediate stages?

jiggawatts · on March 2, 2025

… per core. There are eight per compute tile!

I like to ask IT people a trick question: how many numbers can a modern CPU multiply in the time it takes light to cross a room?

bgnn · on March 2, 2025

Piggy backing on this: memory scaling was slowter than compute scaling, at least since 45nm in the example. For 4nm the difference is larger.

formerly_proven · on March 1, 2025

Random logic had also much better area scaling than SRAM since EUV which implies that gap continues to widen at a faster rate.

bayindirh · on March 1, 2025

> but as opposed to the old intel avx512 cores that got endless (deserved?) bad press for their transition behavior, this is more or less seamless.

The problem with Intel was, the AVX frequencies were secrets. They were never disclosed in later cores where power envelope got tight, and using AVX-512 killed performance throughout the core. This meant that if there was a core using AVX-512, any other cores in the same socket throttled down due to thermal load and power cap on the core. This led to every process on the same socket to suffer. Which is a big no-no for cloud or HPC workloads where nodes are shared by many users.

Secrecy and downplaying of this effect made Intel's AVX-512 frequency and behavior infamous.

Oh, doing your own benchmarks on your own hardware which you paid for and releasing the results to the public was verboten, btw.

LtdJorge · on March 2, 2025

> Oh, doing your own benchmarks on your own hardware which you paid for and releasing the results to the public was verboten, btw.

Well, Cloudflare did anyway.

deaddodo · on March 1, 2025

To be clear, the problem with the Skylake implementation was that triggering AVX-512 would downclock the entirety of the CPU. It didn’t do anything smart, it was fairly binary.

This AMD implementation instead seems to be better optimized and plug into the normal thermal operations of the CPU for better scaling.

eqvinox · on March 1, 2025

Reading the section under "Load Another FP Pipe?" I'm coming away with the impression that it's not the LSU but rather total overall load that causes trouble. While that section is focused on transition time, the end steady state is also slower…

tanelpoder · on March 1, 2025

I haven’t read the article yet, but back when I tried to get to over 100 GB/s IO rate from a bunch of SSDs on Zen4 (just fio direct IO workload without doing anything with the data), I ended up disabling Core Boost states (or maybe something else in BIOS too), to give more thermal allowance for the IO hub on the chip. As RAM load/store traffic goes through the IO hub too, maybe that’s it?

eqvinox · on March 1, 2025

I don't think these things are related, this is talking about the LSU right inside the core. I'd also expect oscillations if there were a thermal problem like you're describing, i.e. core clocks up when IO hub delivers data, IO hub stalls, causes core to stall as well, IO hub can run again delivering data, repeat from beginning.

(Then again, boost clocks are an intentional oscillation anyway…)

tanelpoder · on March 1, 2025

Ok, I just read through the article. As I understand, their tests were designed to run entirely on data on the local cores' cache? I only see L1d mentioned there.

eqvinox · on March 1, 2025

Yes, that's my understanding of "Zen 5 also doubles L1D load bandwidth, and I’m exercising that by having each FMA instruction source an input from the data cache." Also, considering the author's other work, I'm pretty sure they can isolate load-store performance from cache performance from memory interface performance.

rayiner · on March 1, 2025

It seems even more interesting than the power envelope. It looks like the core is limited by the ability of the power supply to ramp up. So the dispatch rate drops momentarily and then goes back up to allow power delivery to catch up.