This is useful when selecting a model for an initial application. The main issue...

lorey · 2026-01-22T12:55:01 1769086501

This is a very good point. When I came in, the founder did a lot of evaluation based on a few prompts and with manual evaluation, exactly as described. Showing the results helped me underline the fact that "works for me" (tm) does not match the actual data in many cases.

cap11235 · 2026-01-22T11:58:33 1769083113

Evals have always existed, and not using them when building systems is relying on superstition.

lorey · 2026-01-22T12:58:59 1769086739

This is true with one caveat.

In most cases, e.g. with regular ML, evals are easy and not doing them results in inferior performance. With LLMs, especially frontier LLMs, this has flipped. Not doing them will likely give you alight performance and at the same time proper benchmarks are tricky to implement.