Those benchmarks are absurdly tuned to the hardware. Just look at the result Google gets with BERT on V100s vs the result NVIDIA gets with V100s. It's an interesting measurement of what experts can achieve when they modify their code to run on the hardware they understand well, but it isn't useful beyond that.
> Just look at the result Google gets with BERT on V100s vs the result NVIDIA gets with V100s.
These benchmarks measure the combination of hardware+software to solve a problem.
Google and NVIDIA are using the same hardware, but their software implementation is different.
---
The reason mlperf.org exists is to have a meaningful set of relevant practical ML problems that can be used to compare and improve hardware and software for ML.
For any piece of hardware, you can create an ML benchmark that's irrelevant in practice, but perform much better on that hardware than the competition. That's what we used to have before mlperf.org was a thing.