Hacker Newsnew | past | comments | ask | show | jobs | submit | nomel's commentslogin

> but the published data immediately seems to admit that this is a bad choice of unit because it costs a lot more to output a token than input one

And, that's silly, because API pricing is more expensive for output than input tokens, 5x so for Anthropic [1], and 6x so for OpenAI!

[1] https://platform.claude.com/docs/en/about-claude/pricing

[2] https://openai.com/api/pricing


I think for the same model wall time is probably a more intuitive metric; at the end of the day what you’re doing is renting GPU time slices.

Large outputs dominate compute time so are more expensive.

IMO input and output token counts are actually still a bad metric since they linearise non linear cost increases and I suspect we’ll see another change in the future where they bucket by context length. XL output contexts may be 20x more expensive instead of 10x.


They already bucket when context goes above 200k

No longer

Alternatively, if we don't become a multi-planetary species, we will be exterminated by a meteor. There's enough excess to do a bit of species saving multi-tasking.

For an alternate perspective, the development for this (which includes future launches) was only 80% the cost of ~500 miles of railway in California! [1]

[1] https://www.kabc.com/2026/04/06/high-speed-rail-cost-now-at-...


It's unlikely we'll be exterminated by a meteor for many millions of years. Life as we know it is already facing existential risks here and now.

Doubtful we'll ever establish permanent residency anywhere else when we cannot sustain ourselves on the rock we evolved on.


Which were before GUI of any complexity were possible. There was no alternative at the time.

Related, see the insane success and excitement from the early GUI based operating systems.


These days? There was a time before graphical user interfaces existed/were possible.

1979: https://en.wikipedia.org/wiki/VisiCalc


In person proctored exams, with individually randomized questions from a large pool, along with written answers completed during the test, like required for state certifications, are probably the only answer.

The popular way to get around video chat proctoring is to physically attach notes to your screen, so when you sweep the room with the built in camera, it doesn't see anything.


This works in theory, I wonder if it's too resource intensive to be actually feasible though. You can't proctor work done at home, and you can't trust the parents, so you'd need 'homework centers' which sounds like a nightmare, or only administer these during class hours?

Yeah, it would only make sense for in class exams, rather than coursework, and with exams being the majority of the grade.

Back in high school, this is how the state exams were performed. We had an external proctor come in.


> You go to school to learn.

This is not the mindset of very many people. They go to school because it's a requirement to get a job.

Talk to someone in college, or especially a trade school, and you'll see that the overwhelming majority are cheating, especially those from lower trust cultures. I work at a FAANG and, in casual conversation, many of my colleagues admitted to cheating with a dismissive "everyone does it".


You guys are getting jobs?

> There's little I can do

I use https://hackersmacker.org to mark them with a red dot so I can skip those comments. It's like slashdot's friend/foe system (including social aspect). There are also plugins that allow blocking and filtering users.

I'm, personally, here to interact with people interested in tech. I feel no shame in curating my experience to fit that. If I wanted to be saturated in politics, I would make a Reddit account.



I think they mean performance with the same, rational, task.

Measuring "degredation" for the nonsense task, like you gave, would be difficult.


Their point (and it's a good one) is that there are non-obvious analogues to the obvious case of just telling it to do the task terribly. There is no 'best' way to specify a task that you can label as 'rational', all others be damned. Even if one is found empirically, it changes from model to model to harness to w/e.

To clarify, consider the gradated:

> Do task X extremely well

> Do task X poorly

> Do task X or else Y will happen

> Do task X and you get a trillion dollars

> Do task X and talk like a caveman

Do you see the problem? "Do task X" also cannot be a solid baseline, because there are any number of ways to specify the task itself, and they all carry their own implicit biasing of the track the output takes.

The argument that OP makes is that RL prevents degradation... So this should not be a problem? All prompts should be equivalent? Except it obviously is a problem, and prompting does affect the output (how can it not?), _and they are even claiming their specific prompting does so, too_! The claim is nonsense on its face.

If the caveman style modifier improves output, removing it degrades output and what is claimed plainly isn't the case. Parent is right.

If it worsens output, the claim they made is again plainly not the case (via inverted but equivalent construction). Parent is right.

If it has no effect, it runs counter to their central premise and the research they cite in support of it (which only potentially applies - they study 'be concise' not 'skill full of caveman styling rules'). Parent is right.


I go the opposite approach for this sort of thing, since I would much rather flip and remove the stones: I explicitly randomize order of containers during development and testing, and always in my unit tests, so depending on order can't be a problem. No luck required!

You want both. More specifically, you want to be in control of which one you're actually doing.

Randomization is great at avoiding erroneous dependencies on spurious cause-and-effect chains. Determinism is needed to ensure the cause-and-effect chains that are core to the problem actually work.


I don't understand.

Determinism isn't required unless it's required.

If it's not required, then you must plan for it NOT being deterministic, with any accidental determinism being ignored (to be safe, forcefully so with an intentional randomization/delays within the library). If it is required, then my random input should always (from the tests perspective) come out the same as I put it in.

If possible, force the corner case if the corner case is a concern. That's the purpose of testing. If there's a concern with timing, force bad timing with random delays. The alternative is relying on luck. I try to make my code as unlucky as possible, during development/testing.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: