It’s an impressive model, but why would OpenAI need to do that?

sebzim4500 · on April 29, 2024

I guess they want to know how good it is as a chatbot and no one has found a better benchmark than lmsys arena.

nickfromseattle · on April 30, 2024

Altman said in the latest Lex Friedman podcast that OAI has consistently received feedback their releases "shock the world", and that they'd like to fix that.

I think releasing to this 3rd party so the internet can start chattering about it and discovering new functionality several months before an official release aligns with that goal of drip-feeding society incremental updates instead of big new releases.

MVissers · on April 30, 2024

They did the same with GPT-4, they were sitting on it for months not knowing how to release. Ended up releasing GPT-3.5 and releasing 4 quietly after nerfing 3.5 into a turbo.

OpenAI sucks at naming though. GPT2 now? Their specific gpt-4-314 etc. model naming was also a mess.

Jensson · on April 30, 2024

> OpenAI sucks at naming though. GPT2 now?

Maybe they got help from Microsoft?

summerlight · on April 29, 2024

At this moment, there's no real world benchmark at scale other than lmsys. All other "benchmarks" are merely sanity checks.

concurrentsquar · on April 30, 2024

OpenAI could either hire private testers or use AB testing on ChatGPT Plus users (for example, oftentimes, when using ChatGPT, I have to select between 2 different responses to continue a conversation); both are probably much more better (in many aspects: not leaking GPT-4.5/5 generations (or the existence of a GPT-4.5/5) to the public at scale and avoiding bias* (because people probably rate GPT-4 generations better if they are told (either explicitly or implicitly (eg. socially)) it's from GPT-5) to say the least) than putting a model called 'GPT2' onto lmsys.

* While lmsys does hide the names of models until a person decides which model generated the best text, people can still figure out what language model generated a piece of text** (or have a good guess) without explicit knowledge, especially if that model is hyped up online as 'GPT-5;' even a subconscious "this text sounds like what I have seen 'GPT2-chatbot' generate online" may influence results inadvertently.

** ... though I will note that I just got a generation from 'gpt2-chatbot' that I thought was from Claude 3 (haiku/sonnet), and its competitor was LLaMa-3-70b (I thought it was 8b or Mixtral). I am obviously not good at LLM authorship attribution.

imtringued · on April 30, 2024

For the average person using lmsys, there is no benefit in choosing your favorite model. Even if you want to stick with your favorite model, choosing a competitor's better answer will still improve the dataset for your favorite model.

The only case where detecting a model makes any difference is for vendors who want to boost their own model by hiring people and paying them every time they select the vendor's model.