This (the device='mps' version) already uses the unified memory plus GPU on M-series Macs.
It's possible MLX has some additional micro optimizations, but in general most people who have tried it out against hand-written MPS based training implementations haven't found great speed ups yet.
It's possible MLX has some additional micro optimizations, but in general most people who have tried it out against hand-written MPS based training implementations haven't found great speed ups yet.