In most cases, e.g. with regular ML, evals are easy and not doing them results in inferior performance. With LLMs, especially frontier LLMs, this has flipped. Not doing them will likely give you alight performance and at the same time proper benchmarks are tricky to implement.
In most cases, e.g. with regular ML, evals are easy and not doing them results in inferior performance. With LLMs, especially frontier LLMs, this has flipped. Not doing them will likely give you alight performance and at the same time proper benchmarks are tricky to implement.