> We don’t have apples-to-apples benchmarks We do: https://mlperf.org/ Just run ...

solidasparagus · on Jan 14, 2021

Those benchmarks are absurdly tuned to the hardware. Just look at the result Google gets with BERT on V100s vs the result NVIDIA gets with V100s. It's an interesting measurement of what experts can achieve when they modify their code to run on the hardware they understand well, but it isn't useful beyond that.

volta87 · on Jan 14, 2021

> Just look at the result Google gets with BERT on V100s vs the result NVIDIA gets with V100s.

These benchmarks measure the combination of hardware+software to solve a problem.

Google and NVIDIA are using the same hardware, but their software implementation is different.

---

The reason mlperf.org exists is to have a meaningful set of relevant practical ML problems that can be used to compare and improve hardware and software for ML.

For any piece of hardware, you can create an ML benchmark that's irrelevant in practice, but perform much better on that hardware than the competition. That's what we used to have before mlperf.org was a thing.

We shouldn't go back there.

sradman · on Jan 14, 2021

> on top of the MLPerf Training and MLPerf Inference benchmark suites, we now have a new MLPerf HPC suite to capture ML of very large models.

I think the challenge is selecting the tests that best represent the typical ML/DL use cases for the M1 and comparing it to an alternative such as the V100 using a common toolchain like Tensorflow. One of the problems that I see is that the optimizer/codegen of the toolchain is a key component; the M1 has both GPU and Neural Engine and we don’t know which accelerator is targeted or even possibly both. Should we benchmark ML Create on M1 vs A14 or A12X? Perhaps it is my ignorance but I don’t think we are at a point where our existing benchmarks can be applied meaningfully with the M1 but I’m sure we will get there soon.

volta87 · on Jan 15, 2021

> The challenge is selecting the tests that best represent the typical ML/DL use cases for the M1 and comparing it to an alternative such as the V100 using a common toolchain like Tensorflow.

The benchmarks there are actual applications of ML, that people use to solve real world problems. To get a benchmark accepted you need to argue and convince people that the problem the benchmark solves must be solved by a lot of people, and that doing so burns enough cycles worldwide to be helpful to design ML hardware and software.

The hardware and software then gets developed to make solving these problems fast, which then in turns make real-world applications of ML fast.

Suggesting that the M1 is a solution, and now we just need to find a good problem that this solution solves well and add it there as a benchmark is the opposite to how mlperf works, and hardware vendors suggesting this is the reason mlperf exists. We already have common ML problems that a lot of people need to solve. Either the M1 is good at those or it isn't. If it isn't, it should become better at those. Being better at problems people don't want / need to solve does not help anybody.