GPU Caching Compared Among AMD, Intel UHD, Apple M1

zmmmmm · on Jan 17, 2023

The unified memory on the Apple M1 Macs which go up to 64G is really quite intriguing. I managed to create a model taking 32G using PyTorch the other day and it was able to handle it using native GPU acceleration. This is larger than any other GPU memory I have access to. Curious if this actually makes such machines an interesting target for ML developers? or not.

ribit · on Jan 17, 2023

The concept is solid, but Apple needs to work on improving performance to be relevant in this field. If they add dedicated matmul capabilities to their GPUs and implement native limited-precision support, their ML training performance will improve by a factor of 4-6x which will instantly make Apple Silicon much more attractive in this domain. The disregard stack and programmability need some improvements as well. For example unified virtual address space and improvements in CPU/GPU communication would be a welcome addition (contrary to intuition latency of CPY/GPU transfers is higher on M1 than on many dGPUs because it can take a very long time for a GPU program to be scheduled).

jb1991 · on Jan 17, 2023

> native limited-precision support

What do you mean by this? They do support integers from 8 bits and up and natively support 16-bit floats. Are you referring to something else, like 8-bit floats?

ribit · on Jan 17, 2023

I mean actually executing the operations at that precision with improved performance. Apple GPUs support both FP16 and FP32 as data types, but the ALU throughput for both is identical (my personal speculation is that ALUs are 32-bit only and rest is data type conversion). From the operational standpoint, Apple G13 SIMD can only do 32 flops per cycle, not more and not less.

But other GPUs support doing operations on limited-precision data types faster. And Nvidia has dedicated matrix multiplication units that can perform very wide limited precision operations per cycle (Apple has similar units but they are part of the CPU clusters).

Apple since A15/M2 offers SIMD matrix multiplication intrinsic (very similar to VK_NV_cooperative_matrix). But the performance is limited by the fact that each SIMD only offers 32 ALUs. If they add the ability to reconfigure these as 64 FP16 ALUs (or 128 FP8 ALUs) and then maybe even doubled the ALUs like Nvidia/AMD recently did with their architectures, they could achieve much higher matmul performance for ML.

zmmmmm · on Jan 18, 2023

thanks for the info!

I have very limited knowledge of these things but I did compare matmul with PyTorch using GPU vs not and there is a dramatic improvement. So even if its not fully optimised yet, it's still a huge bonus to have this available. If it could be improved another 4-6x that would be stupendous.

(For context: am observing an 8000x8000 matrix takes ~1s to multiply with CPU and 5ms to multiply with GPU).

alanfranz · on Jan 16, 2023

> as modern dedicated GPUs, can theoretically do zero-copy transfers by mapping the appropriate memory on both the CPU and GPU.

Is this true for dgpus? How does this work?

yaantc · on Jan 16, 2023

This is not specific to dGPU, it could apply to any PCIe device. Emphasis on "theoretically" too.

On the device (dGPU here), it is possible to route memory accesses to part of the internal address space to the PCIe controller. In turn, the PCIe controller can translate such received memory access into a PCIe request (read or write), in the different PCIe address space, with some address translation.

This PCIe request goes to the PCIe host (CPU in a dGPU scenario). Here too the host PCIe controller can map the PCIe request, using using a PCIe address space address, into the host address space. And this can go to the host memory (after IOMMU filtering and address translation usually). And all this back for the return trip to the device in case of a read.

So latency would be rather high, but technically possible. In most application such transfers are offloaded to a DMA in the PCIe controller doing a copy between PCIe and local address spaces, but a processing core can certainly do a direct access without DMA if all the address mappings are suitably configured.

alanfranz · on Jan 17, 2023

Uuuuuh, ok, but.. what’s the point of doing so? If I do zero-copy on a shared memory area between cpu and gpu, the advantage is clear - no copy and fast transfer.

If I map some host memory to the GPU… I get worse latency and worse bandwidth. Most likely not a win.

yaantc · on Jan 17, 2023

That's why the author says "theoretically" I guess ;) Yes in practice you probably wouldn't want your GPU compute engines to do such direct accesses and stall for a long time on each access, even for a one-shot streaming processing. Then even to avoid using the GPU main memory one would likely use DMA copies to a local working memory and do the processing there by chunks. But the direct mapping can still be convenient: a local DMA engine (or any HW coprocessor) can access host or GPU memory in the same way.

diamondlovesyou · on Jan 17, 2023

See AMD "Smart Memory" a.k.a. PCIe Large Bar. This expands the amount of GPU memory that the CPU can directly access, usually to the GPU's entire memory range (ordinarily only ~256Mb is accessible). GPU->CPU Reads have very high latencies, but that's not an issue for CPU->GPU writes.

GPUs have been able to access "host" memory for a long time now, with a few restrictions: you have to setup the GPU mappings first and pin the pages in memory.

kg · on Jan 16, 2023

In theory for a long time you've been able to "persistently map" A GPU side buffer that houses things like indexes, vertex data, or even textures, and then write directly* into GPU memory from the CPU without a staging buffer. This was referred to as 'AZDO' (Approaching Zero Driver Overhead) in the OpenGL space and eventually fed into the design of Vulkan and Direct3D 12 (see https://www.gdcvault.com/play/1020791/Approaching-Zero-Drive... if you're curious about all of this)

I say in theory and used an asterisk because I think it's generally the case that the driver could lie and just maintain an illusion for you by flushing a staging buffer at the 'right time'. But in practice my understanding is that the memory writes will go straight over the PCIe bus to the GPU and into its memory, perhaps with a bit of write-caching/write-combining locally on the CPU. It would be wise to make sure you never read from that mapped memory :)

garaetjjte · on Jan 17, 2023

AZDO is general term for techniques that reduce driver overhead, not limited to persistent mapping.

OpenGL drivers have a habit to try to second-guess the application (though this depends on the driver, eg. Nvidia guesses a lot, Mesa not so much), but passing/not-passing GL_CLIENT_STORAGE_BIT to glBufferStorage should decide whether buffer should reside in CPU/GPU side memory.

In D3D11 directly mapped GPU memory is known as D3D11_USAGE_DYNAMIC.

LargoLasskhyfv · on Jan 17, 2023

I think that was the whole point of https://en.wikipedia.org/wiki/Heterogeneous_System_Architect... by https://github.com/HSAFoundation/HSA-Runtime-AMD , as a means to standardise ways to access. Not much came out of it, it seems. AMD itself seems to have a aboned that particular thing.

Seems to be that there are other ways now, to achieve this, on lower levels in the hardware, 'transparent' to the layers above. 3D-Vcache stacked on top of the die, and https://www.techarp.com/computer/amd-infinity-cache-explaine... comes to mind.

Salgat · on Jan 17, 2023

The problem is that the only PoC was on APUs and the PS4, both of which had shared access to the RAM anyways making it much simpler to implement. Same is true for M1. The real trick is doing it on something that doesn't have unified memory.

lowbloodsugar · on Jan 16, 2023

>bandwidth is the same for AMD and Apple and much lower for Intel.

Later

>Intel: 700, AMD 1400, Apple: 2100

I wouldn’t call 2x and 3x “similar”.

Also I don’t see why author thinks desktop chips with integrated graphics are meant to be paired with a discreet GPU. Surely the opposite is true. I got a faster CPU by not getting one with integrated graphics.

Finally, doesn’t the fact that apple has a fundamentally different rendering pipeline relevant?

Dalewyn · on Jan 16, 2023

At least with regards to Intel CPUs, iGPU-less CPUs (the ones with -F suffixes) are otherwise identical to the standard ones with iGPUs. The main reason to buy them is the slightly lower price, which could make a difference if you're on a tight budget.

On a tangential note, it's great having an iGPU even if you are almost never going to use it. If your discrete GPU borks, you have a fallback ready and waiting. If you do use it alongside a discrete GPU, you can offload certain lower priority tasks like video encoding/decoding to it.

wtallis · on Jan 16, 2023

And on the AMD side, for several generations their desktop CPUs with integrated GPUs were lagging behind the ones without, because they were really just packaging their laptop silicon for the desktop socket. Each new iteration of the Zen microarchitecture has shipped first in the chiplet-based desktop and server products, then later incorporated into the monolithic laptop SoCs.

Now AMD's desktop chiplet-based CPUs have a tiny GPU in the IO/memory controller die, ill-suited to anything more advanced than everyday web browsing.

dathinab · on Jan 17, 2023

> ill-suited

also not meant for anything but office use, debugging ease of live and maybe offloading some dedicated GPU task to the included decoder in the future.

there is still a good chance we will see some APUs soon like a 7700G it will be interesting to see if they will be Zen 4.

msbarnett · on Jan 16, 2023

> Finally, doesn’t the fact that apple has a fundamentally different rendering pipeline relevant?

Is it still all that fundamentally different? All of the RDNA parts are tile-based renderers (I think even the Vega series GCN parts made that switch?)

ribit · on Jan 17, 2023

It's pretty different alright. First, there is the tile size. For current crop of desktop GPUs, tiling is primarily about cache locality (if you keep your processing spatially local you are also less likely to trash caches), but they still have very fast RAM and want to keep the triangle binning overhead to the minimum. So the tile size for desktop GPUs is much larger (if I remember correctly, it was about 128x128 pixels or something like that when I last tested it on Navi). Mobile GPUs really want to keep all of the relevant processing in the local memory entirely, so they use much smaller tiles (32x32 or even 16x16) at the expense of more involved and costly binning.

Apple (inherited from PowerVR) adds another twist on top: the rasterised pixel are not shaded immediately but instead collected in a buffer. Once all fragments in a tile are rasterised you basically have an array with visible triangle information for each pixel. Pixel shading is then simply a compute pass over this array. This can be more efficient as you only need to shade visible pixels, and it might utilise the SIMD hardware better (as you are shading 32x32 blocks containing multiple triangles at once rather than shading triangles separately), plus it radically simplifies dealing with pixels (there are never any data races for a given pixel, pixel data write-out is just a block memcpy, programmable blending is super easy and cheap to do) — in fact, I don't believe that Apple even has ROPs. There are of course disadvantages as well — it's very tricky to get right and requires specialised fixed-function hardware, you need to keep transformed primitive data around in memory until all primitives are processed (because shading is delayed), there are tons of corner cases you need to handle which can kill your performance(transparency, primitive buffer overflows etc.). And of course, many modern rendering techniques rely on global memory operations and there is an increasing trend to do rasterisation in a compute shader, where this rendering architecture doesn't really help.

garaetjjte · on Jan 17, 2023

They might rasterize fragments inside tiles to reduce blending costs, but still very much behave like immediate renderers: single-pass, with vertex shading results passed continuously into fragment shaders. Apple GPU is tile-based deferred renderer: vertex stage runs first storing results into intermediate buffer, then each tile is processed running fragment shader, at the end flushing results to framebuffer. This reduces memory bandwidth but might require multiple passes when eg. intermediate vertex output buffer overflows.

my123 · on Jan 17, 2023

And there are GPUs that have both operating modes: Adreno.

ribit · on Jan 17, 2023

Does Adreno really have a deferred mode? The documentation I could find only describes tiled immediate rendering.

Edit: I just had another look, pretty sure this is standard Tile-Based Immediate Rendering. The documentation sometimes refers to this as "deferred" probably because copying of the final image values to the RAM is deferred. But "deferred" in TBDR means "deferred shading", not just "deferred memory copy". Adreno does not do deferred shading.

dathinab · on Jan 17, 2023

Many desktop chips which integrated GPU have that GPU only for office use cases and debugging/ease of live. The recent Ryzen processor are a pretty extreme example of this only including a extremely minimal GPU.

So there isn't that much value in the effort to benchmark them.

There are some exceptions for low end gaming systems, some more office use cases and some AIO use-cases e.g. the G-series amd processors like the 5700G. They tend to have GPUs noticable faster then what intel integrated graphics has in the same generation but also noticable less the dedicated graphics.

smueller1234 · on Jan 17, 2023

One is cache/memory bandwidth, the other "FP32 FMA Vector Throughput", isn't it? If so - not the same thing.

hot_gril · on Jan 16, 2023

Nice, succinct 1-2 page article going into interesting technical details. As someone who's hardly touched graphics, GPUs have always been magic to me, especially integrated ones, so it's nice to read digestible explanations about them.

deagle50 · on Jan 16, 2023

Intel Steam Deck 2 would be very interesting. I think they could make something very compelling in the continuous 15W under gaming load space.