That’s true.
The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.
A proper benchmark would require things like:
- Tens of thousands of hands played
- Strict heads-up format (only two models compared at a time)
- Each hand played twice with positions swapped
The current setup is mainly useful for observing common reasoning failure modes and how often they occur.
That’s true. The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.
A proper benchmark would require things like: - Tens of thousands of hands played - Strict heads-up format (only two models compared at a time) - Each hand played twice with positions swapped
The current setup is mainly useful for observing common reasoning failure modes and how often they occur.