j_da's comments

j_da · 2025-08-12T17:07:57 1755018477

All great points. A limitation with human feedback is that once you start asking for more than binary preferences (e.g. multiple rankings or written feedback), the quality of the feedback does decrease. For instance, many times humans can give a quick answer on preference, but when asked "why" they prefer one thing over the other, they might not be able to full explain it in language. This in general is very much an open area of research on collecting and incorporating the most optimal types of feedback.

I definitely agree with your second point. One idea we're experimenting with is adding a human baseline, in which the models are benchmarked against human generated designs as well.

j_da · 2025-08-12T16:58:58 1755017938

Yes, exactly. We want to be a forcing function for better design models and agents.

j_da · 2025-08-12T16:39:17 1755016757

We started out building a platform to one-shot games (single-player and multi-player), but realized that the model you used under the hood really made a difference in functionality and graphics. We started out building the benchmark as an internal tool for ourselves to see which model was the best, but found that benchmarking models on visual "taste" was something that people were generally interested in currently.