That quantizes the gradient for compression of the communication between nodes, which is cool, but each node must still calculate a 32 bit floating point gradient locally. What GP is asking for is a way to avoid having any floating point math at all.
If you could implement training with only single-bit operations rather than floating point math, a hardware implementation could be several orders of magnitude faster and more efficient than current CPUs/GPUs. That would certainly usher in a revolution in computer architecture.
If you could implement training with only single-bit operations rather than floating point math, a hardware implementation could be several orders of magnitude faster and more efficient than current CPUs/GPUs. That would certainly usher in a revolution in computer architecture.