Isharmla's comments

Isharmla · 2025-09-12T07:16:06 1757661366

Nice visualization!

By the way, some of the results look a little weird to me, like the one for the 'Long Neck' prompt. The giraffe of Seedream just lowered its head but its neck didn't shorten as expected. I'd like to learn about the evaluation process, especially whether it is automatic or manual.

vunderba · 2025-09-12T07:48:13 1757663293

Hi Isharmla, the giraffe one was a tough call. IMHO, even when correcting for perspective, I do feel like it managed to follow the directive of the prompt and shorten the neck.

To answer your question, all of the evaluations are performed manually. On the trickier results I'll occasionally conscript some friends to get a group evaluation.

The bottom section of the site has an FAQ that gives more detail, I'll include it here:

It's hard to define a discrete rubric for grading at an inherently qualitative level. To keep things simple, this test is purely PASS/FAIL - unsuccessful means that the model NEVER managed to generate an image adhering to the prompt.

In many cases, we often attempt a generous interpretation of the prompt - if it gets close enough, we might consider it a pass.

To paraphrase former Supreme Court Justice Potter Stewart, "I may not be able to define a passing image, but I know it when I see it."

Isharmla · 2025-09-04T06:19:38 1756966778

> The human brain is constrained in size by the width of the female pelvis.

https://en.wikipedia.org/wiki/Obstetrical_dilemma

While the width is constrained by bipedal locomotion.

Isharmla · 2025-08-25T09:19:04 1756113544

> We have a list of the ones we’ve seen (will post eventually)

I'd like to see if LLM use pw like 123456

Isharmla · 2025-08-13T02:29:03 1755052143

Cool! I'm working on a similar personal tool for GeminiCLI.