Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It’s an impressive model, but why would OpenAI need to do that?


I guess they want to know how good it is as a chatbot and no one has found a better benchmark than lmsys arena.


Altman said in the latest Lex Friedman podcast that OAI has consistently received feedback their releases "shock the world", and that they'd like to fix that.

I think releasing to this 3rd party so the internet can start chattering about it and discovering new functionality several months before an official release aligns with that goal of drip-feeding society incremental updates instead of big new releases.


They did the same with GPT-4, they were sitting on it for months not knowing how to release. Ended up releasing GPT-3.5 and releasing 4 quietly after nerfing 3.5 into a turbo.

OpenAI sucks at naming though. GPT2 now? Their specific gpt-4-314 etc. model naming was also a mess.


> OpenAI sucks at naming though. GPT2 now?

Maybe they got help from Microsoft?


At this moment, there's no real world benchmark at scale other than lmsys. All other "benchmarks" are merely sanity checks.


OpenAI could either hire private testers or use AB testing on ChatGPT Plus users (for example, oftentimes, when using ChatGPT, I have to select between 2 different responses to continue a conversation); both are probably much more better (in many aspects: not leaking GPT-4.5/5 generations (or the existence of a GPT-4.5/5) to the public at scale and avoiding bias* (because people probably rate GPT-4 generations better if they are told (either explicitly or implicitly (eg. socially)) it's from GPT-5) to say the least) than putting a model called 'GPT2' onto lmsys.

* While lmsys does hide the names of models until a person decides which model generated the best text, people can still figure out what language model generated a piece of text** (or have a good guess) without explicit knowledge, especially if that model is hyped up online as 'GPT-5;' even a subconscious "this text sounds like what I have seen 'GPT2-chatbot' generate online" may influence results inadvertently.

** ... though I will note that I just got a generation from 'gpt2-chatbot' that I thought was from Claude 3 (haiku/sonnet), and its competitor was LLaMa-3-70b (I thought it was 8b or Mixtral). I am obviously not good at LLM authorship attribution.


For the average person using lmsys, there is no benefit in choosing your favorite model. Even if you want to stick with your favorite model, choosing a competitor's better answer will still improve the dataset for your favorite model.

The only case where detecting a model makes any difference is for vendors who want to boost their own model by hiring people and paying them every time they select the vendor's model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: