We analyzed 7,877 responses from 250 models (same prompts, controlled conditions) and found interesting patterns that don't show up in standard benchmarks.
→ Name hallucination convergence: "Professor Chen" appears 279 times across unrelated creative writing prompts. 19/250 models independently chose "Sir Reginald" for a knight character.
→ Provider fingerprinting via punctuation: Claude averages 5.90 em dashes per 1K words. Gemini: 9.18 exclamation marks. Mistral: 3.48 emoji. Distinct enough for classification.
→ Cultural reference bias: 59:1 Western to non-Western ratio. 1,069 Western references vs 18 non-Western across the full corpus.
→ Humor convergence: 42% of models open with the identical atom joke. Format convergence is total: 100% produce setup-punchline one-liners. Zero attempt observational comedy.
→ Visual generation defaults: "Draw a surprise animal" produces a fox 40% of the time (DeepSeek: 67%, Llama: 0%).
Full writeup in the link (the complete research is gated)
Finished testing gpt 5 pro, the difference in quality to any other SOTA is minimal, yet the time to respond and cost are sky high.
I'm still waiting for a generational leap that would freaking change the field. After testing 121 AI models I can tell most give the same boring responses to a certain degree:
- All models do the pelican SVG challenge with the exact same POV
- All models give "5 unique jokes" that are extremely similar
- Models from different providers give similar responses to creative open-ended questions
A bit of a different - experimental - tool I've built for macOS. This one helps you ponder decisions among 2-3 options over several factors of different weights to arrive at a logical conclusion. I'm in the process of building 26 open-source macOS apps, one for each letter of the alphabet.
→ Name hallucination convergence: "Professor Chen" appears 279 times across unrelated creative writing prompts. 19/250 models independently chose "Sir Reginald" for a knight character.
→ Provider fingerprinting via punctuation: Claude averages 5.90 em dashes per 1K words. Gemini: 9.18 exclamation marks. Mistral: 3.48 emoji. Distinct enough for classification.
→ Cultural reference bias: 59:1 Western to non-Western ratio. 1,069 Western references vs 18 non-Western across the full corpus.
→ Humor convergence: 42% of models open with the identical atom joke. Format convergence is total: 100% produce setup-punchline one-liners. Zero attempt observational comedy.
→ Visual generation defaults: "Draw a surprise animal" produces a fox 40% of the time (DeepSeek: 67%, Llama: 0%).
Full writeup in the link (the complete research is gated)