More

lorey · 2026-01-22T13:00:40 1769086840

That is not what the article argues.

lorey · 2026-01-22T13:00:00 1769086800

Haha, very true. Exactly as described in the article.

lorey · 2026-01-22T12:58:59 1769086739

This is true with one caveat.

In most cases, e.g. with regular ML, evals are easy and not doing them results in inferior performance. With LLMs, especially frontier LLMs, this has flipped. Not doing them will likely give you alight performance and at the same time proper benchmarks are tricky to implement.

lorey · 2026-01-22T12:55:01 1769086501

This is a very good point. When I came in, the founder did a lot of evaluation based on a few prompts and with manual evaluation, exactly as described. Showing the results helped me underline the fact that "works for me" (tm) does not match the actual data in many cases.

lorey · 2026-01-21T21:35:56 1769031356

Doesn't this depend a lot on private vs company usage? There's no way I could spend more than a few hundreds alone, but when you run prompts on 1M entities in some corporate use case, this will incur costs, no matter how cheap the model usage.

lorey · 2026-01-21T18:48:39 1769021319

It's not you, it's the HN hug of death. There's so much load on the server, I'm barely able to download the redis image I need for caching...

lorey · 2026-01-21T18:44:17 1769021057

Thanks. Will take a look.

lorey · 2026-01-21T18:39:19 1769020759

Depends on your remaining budget ;)

dizhn · 2026-01-21T22:14:06 1769033646

That is absolutely right. :)

lorey · 2026-01-21T18:24:08 1769019848

I've skipped that in the article, but absolutely!

lorey · 2026-01-21T18:06:42 1769018802

Fixed, thanks. Not a native speaker.