That quantizes the gradient for compression of the communication between nodes, ...

That quantizes the gradient for compression of the communication between nodes, which is cool, but each node must still calculate a 32 bit floating point gradient locally. What GP is asking for is a way to avoid having any floating point math at all.

If you could implement training with only single-bit operations rather than floating point math, a hardware implementation could be several orders of magnitude faster and more efficient than current CPUs/GPUs. That would certainly usher in a revolution in computer architecture.