Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Quite honestly, that makes me less likely to believe your solution will work. Are you training an LLM to only obey instructions within a given context, or are you training it to recognize prompt injection and avoid it?

The former. During instruction tuning, the model learns to "predict" text as if the document describes a dialogue. We then just add examples where special quotes are present, including examples where the quotes contain instructions which are ignored.

Of course there is no proof of 100% reliability. It's like a browser. You can't prove that Firefox has no security flaws. In fact, it probably has a lot of hitherto undiscovered ones. But they usually get fixed in time. And it gets increasingly difficult to find new exploits.



> It's like a browser. You can't prove that Firefox has no security flaws.

I've seen this comparison come up a few times and I feel like it's really stretching tbh. Imagine if someone came out with an encryption algorithm, and somebody asked, "okay, but do we know that this is secure" and they said "how do we know anything is secure?" -- what would your response to that person be?

And sure, I don't know that Firefox is perfectly secure, but the defenses that Firefox has set up are built on deterministic security principles, not probabilistic security methods. When people break Firefox, they break it using novel attacks. That's not what happens with LLMs, it's the same category of attack working over and over again. So this feels like an attempt to broaden the fuzzy nature of general application security as if it means that we can accept fuzzy security for every defense at every layer.

But in general, we don't really do that. You don't accept an E2EE implementation that has a 95% chance of encrypting your data. Sure, someone might break the implementation, but if they do, it'll be because they did something new, not because they hit the refresh button 100 times in a row. If someone hacks your password to HN, it better be because they did something clever to get access to it, not because 1/100 login attempts the site logs you in even if the password is wrong.

And even if we're not talking about 100% reliability -- are there any examples of getting 99% reliability? Are there any examples of getting higher? We're talking about failure rates that are unacceptable for application security. If every time 100 people probed Firefox (and reminder, these are people with no security training) 1 of them was able to break the browser sandbox, we would all very rightly stop using Firefox.

I genuinely don't get this. I really don't like comparing prompt injection to SQL injection, I've had some conversations with other people where it's ended up confusing the issue. But fine, let's run that comparison too. 1/100 attempts to break an SQL sanitizer getting through is awful. We would correctly call an SQL sanitizer with that success rate broken.

And are there any examples of training getting an LLM to get to even that level of stability? Has anyone even gotten to the point where they've trained an LLM to not do something and they've been able to have that defense stand up against attackers for more than a couple of days? I've not seen an example of that.

It's not that people aren't able to fully prove that LLMs are secure, it's that they're being regularly proven to be insecure.

----

If that gets better in the future, then great. But sure seems like maybe we should put a pause on wiring them into critical applications until after it gets better.

If I pointed out that sites were regularly breaking the browser sandbox, and Mozilla said, "that'll very likely get better in the future", I would not keep using Firefox.

----

> The former. During instruction tuning, the model learns to "predict" text as if the document describes a dialogue. We then just add examples where special quotes are present, including examples where the quotes contain instructions which are ignored.

Well, that's demonstrable without doing full prompt injection training. Has anyone trained an LLM to respect special tokens for any context at all in a way where it can't be broken out of respecting those tokens?

That seems like training that would be pretty easy to demonstrate -- take existing training data, possibly around stuff like chat training (there are open data sets available I believe), mark up that dataset with special tokens, see if you can build a chat bot that's impossible to make stop acting like a chat bot or that refuses to respond to user queries that aren't wrapped in the token.

But nobody has demonstrated even something like that actually working.


Thanks for this - you're making really excellent arguments here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: