Hacker Newsnew | past | comments | ask | show | jobs | submit | nuancedev's commentslogin

We analyzed 7,877 responses from 250 models (same prompts, controlled conditions) and found interesting patterns that don't show up in standard benchmarks.

→ Name hallucination convergence: "Professor Chen" appears 279 times across unrelated creative writing prompts. 19/250 models independently chose "Sir Reginald" for a knight character.

→ Provider fingerprinting via punctuation: Claude averages 5.90 em dashes per 1K words. Gemini: 9.18 exclamation marks. Mistral: 3.48 emoji. Distinct enough for classification.

→ Cultural reference bias: 59:1 Western to non-Western ratio. 1,069 Western references vs 18 non-Western across the full corpus.

→ Humor convergence: 42% of models open with the identical atom joke. Format convergence is total: 100% produce setup-punchline one-liners. Zero attempt observational comedy.

→ Visual generation defaults: "Draw a surprise animal" produces a fox 40% of the time (DeepSeek: 67%, Llama: 0%).

Full writeup in the link (the complete research is gated)


Finished testing gpt 5 pro, the difference in quality to any other SOTA is minimal, yet the time to respond and cost are sky high.

I'm still waiting for a generational leap that would freaking change the field. After testing 121 AI models I can tell most give the same boring responses to a certain degree:

- All models do the pelican SVG challenge with the exact same POV - All models give "5 unique jokes" that are extremely similar - Models from different providers give similar responses to creative open-ended questions


A bit of a different - experimental - tool I've built for macOS. This one helps you ponder decisions among 2-3 options over several factors of different weights to arrive at a logical conclusion. I'm in the process of building 26 open-source macOS apps, one for each letter of the alphabet.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: