It's not quite so trivial to implement this solution. SL instruction tuning actually needs a lot of examples, and only recently there have been approaches to automate this, like WizardLM: https://github.com/nlpxucan/WizardLM
To try my solution, this would have to be adapted to more complex training examples involving quoted text with prompt injection attempts.
Similar points holds for RL. I actually think it is much more clean to solve it during instruction tuning, but perhaps we also need some RL. This normally requires training a reward model with large amounts of human feedback. Alternative approaches like Constitutional AI would first have to be adapted to cover quotes with prompt injection attacks.
Probably doable, but takes some time and effort, all the while prompt injection doesn't seem to be a big practical issue currently.
> To try my solution, this would have to be adapted to more complex training examples involving quoted text with prompt injection attempts.
Quite honestly, that makes me less likely to believe your solution will work. Are you training an LLM to only obey instructions within a given context, or are you training it to recognize prompt injection and avoid it? Because even if the first is possible, the second is probably a lot harder.
Let's get more basic though. Whether you're doing instruction tuning or reinforcement training or constitutional training, are there any examples of any of these mechanisms getting 100% consistency in blocking any behavior?
I can't personally think of one. Surely the baseline here before we even start talking about prompt injection is: is there any proof that you can train an LLM to predictably and fully reliably block anything at all?
> Quite honestly, that makes me less likely to believe your solution will work. Are you training an LLM to only obey instructions within a given context, or are you training it to recognize prompt injection and avoid it?
The former. During instruction tuning, the model learns to "predict" text as if the document describes a dialogue. We then just add examples where special quotes are present, including examples where the quotes contain instructions which are ignored.
Of course there is no proof of 100% reliability. It's like a browser. You can't prove that Firefox has no security flaws. In fact, it probably has a lot of hitherto undiscovered ones. But they usually get fixed in time. And it gets increasingly difficult to find new exploits.
> It's like a browser. You can't prove that Firefox has no security flaws.
I've seen this comparison come up a few times and I feel like it's really stretching tbh. Imagine if someone came out with an encryption algorithm, and somebody asked, "okay, but do we know that this is secure" and they said "how do we know anything is secure?" -- what would your response to that person be?
And sure, I don't know that Firefox is perfectly secure, but the defenses that Firefox has set up are built on deterministic security principles, not probabilistic security methods. When people break Firefox, they break it using novel attacks. That's not what happens with LLMs, it's the same category of attack working over and over again. So this feels like an attempt to broaden the fuzzy nature of general application security as if it means that we can accept fuzzy security for every defense at every layer.
But in general, we don't really do that. You don't accept an E2EE implementation that has a 95% chance of encrypting your data. Sure, someone might break the implementation, but if they do, it'll be because they did something new, not because they hit the refresh button 100 times in a row. If someone hacks your password to HN, it better be because they did something clever to get access to it, not because 1/100 login attempts the site logs you in even if the password is wrong.
And even if we're not talking about 100% reliability -- are there any examples of getting 99% reliability? Are there any examples of getting higher? We're talking about failure rates that are unacceptable for application security. If every time 100 people probed Firefox (and reminder, these are people with no security training) 1 of them was able to break the browser sandbox, we would all very rightly stop using Firefox.
I genuinely don't get this. I really don't like comparing prompt injection to SQL injection, I've had some conversations with other people where it's ended up confusing the issue. But fine, let's run that comparison too. 1/100 attempts to break an SQL sanitizer getting through is awful. We would correctly call an SQL sanitizer with that success rate broken.
And are there any examples of training getting an LLM to get to even that level of stability? Has anyone even gotten to the point where they've trained an LLM to not do something and they've been able to have that defense stand up against attackers for more than a couple of days? I've not seen an example of that.
It's not that people aren't able to fully prove that LLMs are secure, it's that they're being regularly proven to be insecure.
----
If that gets better in the future, then great. But sure seems like maybe we should put a pause on wiring them into critical applications until after it gets better.
If I pointed out that sites were regularly breaking the browser sandbox, and Mozilla said, "that'll very likely get better in the future", I would not keep using Firefox.
----
> The former. During instruction tuning, the model learns to "predict" text as if the document describes a dialogue. We then just add examples where special quotes are present, including examples where the quotes contain instructions which are ignored.
Well, that's demonstrable without doing full prompt injection training. Has anyone trained an LLM to respect special tokens for any context at all in a way where it can't be broken out of respecting those tokens?
That seems like training that would be pretty easy to demonstrate -- take existing training data, possibly around stuff like chat training (there are open data sets available I believe), mark up that dataset with special tokens, see if you can build a chat bot that's impossible to make stop acting like a chat bot or that refuses to respond to user queries that aren't wrapped in the token.
But nobody has demonstrated even something like that actually working.
To try my solution, this would have to be adapted to more complex training examples involving quoted text with prompt injection attempts.
Similar points holds for RL. I actually think it is much more clean to solve it during instruction tuning, but perhaps we also need some RL. This normally requires training a reward model with large amounts of human feedback. Alternative approaches like Constitutional AI would first have to be adapted to cover quotes with prompt injection attacks.
Probably doable, but takes some time and effort, all the while prompt injection doesn't seem to be a big practical issue currently.