If I understand correctly, the author is managing to run this model on a laptop ...

simonw · 2025-07-29T14:21:56 1753798916

Only if that RAM is available to a GPU, or you're willing to tolerate extremely slow responses.

The neat thing about Apple Silicon is the system RAM is available to the GPU. On most other systems you would need ~48GB of VRAM.

xrd · 2025-07-29T15:53:02 1753804382

Aren't there non-Macos laptops which also support sharing the VRAM and regular RAM, i.e. iGPU?

https://www.reddit.com/r/GamingLaptops/comments/1akj5aw/what...

I personally want to run linux and feel like I'll get a better price/GB offering that way. But, it is confusing to know how local models will actually work on those and the drawbacks of iGPU.

mft_ · 2025-07-29T18:25:43 1753813543

iGPUs are typically weak, and/or aren't capable of running the LLM so the CPU is used instead. You can run things this way, but it's not fast, and it gets slower as the models go up in size.

If you want things to run quickly, then aside from Macs, there's the 2025 ASUS Flow z13 which (afaik) is the only laptop with AMD's new Ryzen Max+ 395 processor. This is powerful and has up to 128Gb of RAM that can be shared with the GPU, but they're very rare (and Mac-expensive) at the moment.

The other variable for running LLMs quickly is memory bandwidth; the Max+ 395 has 256Gb/s, which is similar to the M4 Pro; the M4 Max chips are considerably higher. Apple fell on their feet on this one.

sagarm · 2025-07-30T06:00:57 1753855257

LLM evaluation on GPU and CPU is memory bandwidth constrained. The highest-end Apple machines are good for this because they have ~500GBps high memory bandwidth and up to ~128GB, not just because they can share that memory with the GPU (which any iGPU does). Most consumer machines are limited to 2xDDR5 channels (~50GBps).

NitpickLawyer · 2025-07-29T14:27:02 1753799222

> So a home workstation with 64GB+ of RAM could get similar results?

Similar in quality, but CPU generation will be slower than what macs can do.

What you can do with MoEs (GLMs and Qwens) is to run some experts (the shared ones usually) on a GPU (even a 12GB/16GB will do) and the rest from RAM on CPU. That will speed things up considerably (especially prompt processing). If you're interested in this, look up llama.cpp and especially ik_llama, which is a fork dedicated to this kind of selective offloading of experts.

simlevesque · 2025-07-29T14:22:01 1753798921

Not so sure. The MBP uses hybrid memory, the ram is shared with the cpu and gpu.

Your 64gb workstation doesn't share the ram with your gpu.

0x457 · 2025-07-29T19:43:04 1753818184

You can run, it will just run on CPU and will be pretty slow. Macs, like everyone in this thread said, use unified memory, so it's 64GB between CPU and GPU, while for you its just 64 for CPU.

lynndotpy · 2025-07-29T14:21:56 1753798916

The laptop has "unified RAM", so that's like 64GB of VRAM.