Funnily enough what the article describes is almost the perfect case for testing...

Funnily enough what the article describes is almost the perfect case for testing: "We wrote the first scoring algorithm...based on the red-green light." This sounds like it describes some heuristic weighting, which couldn't have been solved by types, but reasonable tests would have shown if the new algorithm weighted some special cases higher or lower than it should.

The problem with an heuristic weight though, is that it's an heuristic, judged against other heuristics by taste not proof.

The obvious testing approach, ensuring that the score for each test case retains the same order as you tweak the algorithm - is overtesting. You don't care about this total order; you more likely care about ordering of classes of things, rather than ordering within those classes; or simply that 'likely' cases follow an order. Hence, you hit far too many test failures.

I'd agree that it's possible to overtest in general, but it's so easy to overtest heuristics that it needs called out as a special case, and it sounds like the problem here.