Hacker Newsnew | past | comments | ask | show | jobs | submit | naasking's commentslogin

I think many humans engage in metacognitive reasoning, and that this might not be strongly represented in training data so it probably isn't common to LLMs yet. They can still do it when prompted though.

LLMs have zero metacognition. Don't be fooled - their output is stochastic inference and they have no self-awareness. The best you'll see is an improvised post-hoc rationalization story.

You can turn all these argents around and prove the same is true for humans. Don't be fooled by dogmatic people who spread the idea that the human mind is the pinnacle of cognition in the universe. Best to leave that to religion.

Humans may not always be that smart, but we do at least have an internal state and an awareness of that internal state - a "self-awareness".

AI most certainly has nothing of the sort, and any appearance to the contrary is the direct result of training data.


> The best you'll see is an improvised post-hoc rationalization story.

Funny, because "post-hoc rationalization" is how many neuroscientists think humans operate.

That LLMs are stochastic inference engines is obvious by construction, but you skipped the step where you proved that human thoughts, self-awareness and metacognition are not reducible to stochastic inference.


I'm not saying we don't do post-hoc rationalization. But self-awareness is a trait we possess to varying degrees, and reporting on a memory of a past internal state is at least sometimes possible, even if we don't always choose to do so.

> Conversely: in humans, intelligence is inversely correlated with crime.

Inversely correlated with crime that's caught and successfully prosecuted, you mean, because that's what makes up the stats on crime. I think people too often forget that we consider most criminals "dumb" because those who are caught are mostly dumb. Smart "criminals" either don't get caught or have made their unethical actions legal.


I'm curious if frontier labs use any forms of compression on their models to improve performance. The small % drop of Q8 or FP8 would still put it ahead of Opus, but should double token throughput. Maybe then interactive use would feel like an improvement.

I used GLM5 quite a bit, and I'd say it was maybe on par with Sonnet for most simple to medium tasks. Definitely not Opus though. Didn't test super long context tasks, and that's where I would expect it to break down. A recent study on software maintainability still showed Sonnet and Opus were peerless on that metric, although GLM series of models has been making impressive gains.

Very interesting. I run Claude Code in VS Code, and unfortunately there doesn't seem to be an equivalent to "cli.js", it's all bundled into the "claude.exe" I've found under the VS code extensions folder (confirmed via hex editor that the prompts are in there).

Edit: tried patching with revised strings of equivalent length informed by this gist, now we'll see how it goes!


They're a business. The alternative to keep costs in check would to ask you for more money, and you'd likely be even more upset with that.

They are definitely that. Regardless of their approach, being upfront and transparent would have been nice. Bricking their own software that previously worked well for their customers isn't cool.

It's interesting that LLMs improve skills, especially on harder problems, just by practicing them. That's effectively what's going on.

> I only ask because I've been running local models (using Ollama) on my RX 7900 XTX for the last year and a half or so and haven't had a single problem that was ROCm specific that I can think of.

It's probably using the Vulkan backend, that is pretty stable and performance is good.


Small models aren't entirely useless, and the NPU can run LLMs up to around 8B parameters from what I've seen. So one way they could be useful: Qwen3 text to speech models are all under 2B parameters, and Open AI's whisper-small speech to text model is under 1B parameters, so you could have an AI agent that you could talk to and could talk back, where, in theory, you could offload all audio-text and text-audio processing to the low power NPU and leave the GPU to do all of the LLM processing.

That seems like a really niche use case, and probably not worth the surface area? The power savings would have to be truly astonishing to justify it, given what a small fraction of compute time your average device spends processing voice input. I'd wager the 90th percentile siri/ok google/whatever user issues less than 10 voice queries per day. How much power can they use running on normal hardware and how much could it possibly matter?

It's just an example where it fits perfectly, and it's exactly what something like Alexa or Google home needs for low power machine learning, eg. when sitting idle it needs to consume as little power as possible while waiting for a trigger word.

Any context that needs some limited intelligence while consuming little power would benefit from this.


You could always offload some layers to the NPU for lower power use and leave the rest to the GPU. If the latter is power throttled (common for prefill, not for decode) that will be a performance improvement.

Routing in a MoE model might fit.

You want routing to be as quick as possible, because there are dependent loads of expert MoE weights (at least from CPU in most setups, potentially from storage) downstream of it. So that ultimately depends on what the bottleneck on that part of the model is: compute, memory throughput or both? If it's throughput, the NPU might be a bad fit.

Yes, Vulkan is currently faster due to some ROCm regressions: https://github.com/ROCm/ROCm/issues/5805#issuecomment-414161...

ROCm should be faster in the end, if they ever fix those issues.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: