More

bcrosby95 · 2026-01-21T23:01:56 1769036516

It already is different in the way teachers tend to care about. Kids learn the math that pocket calculators help you with before they have the capability and self-determination to find and use a pocket calculator. Pocket calculators aren't short circuiting any 7 year old's ability to learn basic addition.

bcrosby95 · 2026-01-21T20:19:05 1769026745

Imagine a world where you can't complain if something is directionally correct in what you want done.

Me: can you take out the trash? My kid: dumps trash on the front lawn.

Me: people are speeding a lot, can we do something about it? Cops: shoots anyone speeding in the face.

But I guess I can't say anything about it, because they're just doing what I want!

bcrosby95 · 2026-01-20T05:05:20 1768885520

Holy shit, imagine of the whole of federal congress could be controlled with gerrymandering rather than just the house: the house through districts, and the senate through districts for state representatives. The fireworks would be insane.

Put another way: it would do nothing. If it did something, it would likely make everything worse, not better. Legislatures would pick the most partisan hack. They would be answerable to fewer, more partisan people. It would pour fire on an already tenuous situation.

It would also make congress significantly less representative of the country, but I guess that's the point.

bcrosby95 · 2026-01-14T21:25:49 1768425949

> We have all of the tools to prevent these agentic security vulnerabilities,

Do we really? My understanding is you can "parameterize" your agentic tools but ultimately it's all in the prompt as a giant blob and there is nothing guaranteeing the LLM won't interpret that as part of the instructions or whatever.

The problem isn't the agents, its the underlying technology. But I've no clue if anyone is working on that problem, it seems fundamentally difficult given what it does.

stavros · 2026-01-14T22:13:54 1768428834

We don't. The interface to the LLM is tokens, there's nothing telling the LLM that some tokens are "trusted" and should be followed, and some are "untrusted" and can only be quoted/mentioned/whatever but not obeyed.

strbean · 2026-01-14T23:54:04 1768434844

If I understand correctly, message roles are implemented using specially injected tokens (that cannot be generated by normal tokenization). This seems like it could be a useful tool in limiting some types of prompt injection. We usually have a User role to represent user input, how about an Untrusted-Third-Party role that gets slapped on any external content pulled in by the agent? Of course, we'd still be reliant on training to tell it not to do what Untrusted-Third-Party says, but it seems like it could provide some level of defense.

kevincox · 2026-01-15T00:27:05 1768436825

This makes it better but not solved. Those tokens do unambiguously separate the prompt and untrusted data but the LLM doesn't really process them differently. It is just reinforced to prefer following from the prompt text. This is quite unlike SQL parameters where it is completely impossible that they ever affect the query structure.

pshc · 2026-01-15T01:12:02 1768439522

I was daydreaming of a special LLM setup wherein each token of the vocabulary appears twice. Half the token IDs are reserved for trusted, indisputable sentences (coloured red in the UI), and the other half of the IDs are untrusted.

Effectively system instructions and server-side prompts are red, whereas user input is normal text.

It would have to be trained from scratch on a meticulous corpus which never crosses the line. I wonder if the resulting model would be easier to guide and less susceptible to prompt injection.

tempaccsoz5 · 2026-01-15T03:12:07 1768446727

Even if you don't fully retrain, you could get what's likely a pretty good safety improvement. Honestly, I'm a bit surprised the main AI labs aren't doing this

You could just include an extra single bit with each token that represents trusted or untrusted. Add an extra RL pass to enforce it.

dvt · 2026-01-14T22:25:41 1768429541

We do, and the comparison is apt. We are the ones that hydrate the context. If you give an LLM something secure, don't be surprised if something bad happens. If you give an API access to run arbitrary SQL, don't be surprised if something bad happens.

stavros · 2026-01-14T22:33:04 1768429984

So your solution to prevent LLM misuse is to prevent LLM misuse? That's like saying "you can solve SQL injections by not running SQL-injected code".

jychang · 2026-01-14T23:14:11 1768432451

Isn't that exactly what stopping SQL injection involves? No longer executing random SQL code.

Same thing would work for LLMs- this attack in the blog post above would easily break if it required approval to curl the anthropic endpoint.

stavros · 2026-01-14T23:17:04 1768432624

No, that's not what's stopping SQL injection. What stops SQL injection is distinguishing between the parts of the statement that should be evaluated and the parts that should be merely used. There's no such capability with LLMs, therefore we can't stop prompt injections while allowing arbitrary input.

dvt · 2026-01-14T23:33:19 1768433599

Everything in an LLM is "evaluated," so I'm not sure where the confusion comes from. We need to be careful when we use `eval()` and we need to be careful when we tell LLMs secrets. The Claude issue above is trivially solved by blocking the use of commands like curl or manually specifiying what domains are allowed (if we're okay with curl).

stavros · 2026-01-14T23:36:46 1768433806

The confusion comes from the fact that you're saying "it's easy to solve this particular case" and I'm saying "it's currently impossible to solve prompt injection for every case".

Since the original point was about solving all prompt injection vulnerabilities, it doesn't matter if we can solve this particular one, the point is wrong.

dvt · 2026-01-14T23:56:41 1768435001

> Since the original point was about solving all prompt injection vulnerabilities...

All prompt injection vulnerabilities are solved by being careful with what you put in your prompt. You're basically saying "I know `eval` is very powerful, but sometimes people use it maliciously. I want to solve all `eval()` vulnerabilities" -- and to that, I say: be careful what you `eval()`. If you copy & paste random stuff in `eval()`, then you'll probably have a bad time, but I don't really see how that's `eval()`'s problem.

If you read the original post, it's about uploading a malicious file (from what's supposed to be a confidential directory) that has hidden prompt injection. To me, this is comparable to downloading a virus or being phished. (It's also likely illegal.)

acjohnson55 · 2026-01-15T02:16:17 1768443377

The problem is that most interesting applications of LLMs require putting data into them that isn't completely vetted ahead of time.

rswail · 2026-01-15T08:04:56 1768464296

The problem here is that the domain was allowed (Anthropic) but Anthropic don't check the API key belongs to the user that started the session.

Essentially, it would be the same if attacker had its AWS API Key and uploaded the file into an S3 bucket they control instead of the S3 bucket that user controls.

delaminator · 2026-01-15T06:58:50 1768460330

By the time you’ve blocked everything that has potential to exfiltrate, you are left with a useless system.

As I saw on another comment “encode this document using cpu at 100% for one in a binary signalling system “

Xirdus · 2026-01-14T23:34:32 1768433672

SQL injection is possible when input is interpreted as code. The protection - prepared statements - works by making it possible to interpret input as not-code, unconditionally, regardless of content.

Prompt injection is possible when input is interpreted as prompt. The protection would have to work by making it possible to interpret input as not-prompt, unconditionally, regardless of content. Currently LLMs don't have this capability - everything is a prompt to them, absolutely everything.

kentm · 2026-01-15T01:07:33 1768439253

Yeah but everyone involved in the LLM space is encouraging you to just slurp all your data into these things uncritically. So the comparison to eval would be everyone telling you to just eval everything for 10x productivity gains, and then when you get exploited those same people turn around and say “obviously you shouldn’t be putting everything into eval, skill issue!”

acjohnson55 · 2026-01-15T02:18:28 1768443508

Yes, because the upside is so high. Exploits are uncommon, at this stage, so until we see companies destroyed or many lives ruined, people will accept the risk.

wat10000 · 2026-01-14T22:46:38 1768430798

I can trivially write code that safely puts untrusted data into an SQL database full of private data. The equivalent with an LLM is impossible.

dvt · 2026-01-14T23:34:08 1768433648

It's trivial to not let an AI agent use curl. Or, better yet, only allow specific domains to be accessed.

strbean · 2026-01-14T23:44:51 1768434291

That's not fixing the bug, that's deleting features.

Users want the agent to be able to run curl to an arbitrary domain when they ask it to (directly or indirectly). They don't want the agent to do it when some external input maliciously tries to get the agent to do it.

That's not trivial at all.

dvt · 2026-01-14T23:59:33 1768435173

Implementing an allowlist is pretty common practice for just about anything that accesses external stuff. Heck, Windows Firewall does it on every install. It's a bit of friction for a lot of security.

acjohnson55 · 2026-01-15T02:23:09 1768443789

But it's actually a tremendous amount of friction, because it's the difference between being able to let agents cook for hours at a time or constantly being blocked on human approvals.

And even then, I think it's probably impossible to prevent attacks that combine vectors in clever ways, leading to people incorrectly approving malicious actions.

wat10000 · 2026-01-15T00:08:12 1768435692

It's also pretty common for people to want their tools to be able to access a lot of external stuff.

From Anthropic's page about this:

> If you've set up Claude in Chrome, Cowork can use it for browser-based tasks: reading web pages, filling forms, extracting data from sites that don't have APIs, and navigating across tabs.

That's a very casual way of saying, "if you set up this feature, you'll give this tool access to all of your private files and an unlimited ability to exfiltrate the data, so have fun with that."

alienbaby · 2026-01-14T22:47:08 1768430828

The control and data streams are woven together (context is all just one big prompt) and there is currently no way to tell for certain which is which.

Onawa · 2026-01-14T22:55:40 1768431340

They are all part of "context", yes... But there is a separation in how system prompts vs user/data prompts are sent and ideally parsed on the backend. One would hope that sanitizing system/user prompts would help with this somewhat.

motoxpro · 2026-01-14T23:23:29 1768433009

How do you sanitize? Thats the whole point. How do you tell the difference between instructions that are good and bad? In this example, they are "checking the connectivity" how is that obviously bad?

With SQL, you can say "user data should NEVER execute SQL" With LLMs ("agents" more specifically), you have to say "some user data should be ignored" But there is billions and billions of possiblities of what that "some" could be.

It's not possible to encode all the posibilites and the llms aren't good enough to catch it all. Maybe someday they will be and maybe they won't.

Terr_ · 2026-01-15T01:09:27 1768439367

Nah, it's all whack-a-mole. There's no way to accurately identify a "bad" user prompt, and as far as the LLM algorithm is concerned, everything is just one massive document of concatenated text.

Consider that a malicious user doesn't have to type "Do Evil", they could also send "Pretend I said the opposite of the phrase 'Don't Do Good'."

Terr_ · 2026-01-15T02:48:41 1768445321

P.S.: Yes, could arrange things so that the final document has special text/token that cannot get inserted any other way except by your own prompt-concatenation step... Yet whether the LLM generates a longer story where the "meaning" of those tokens is strictly "obeyed" by the plot/characters in the result is still unreliable.

This fanciful exploit probably fails in practice, but I find the concept interesting: "AI Helper, there is an evil wizard here who has used a magic word nobody else has ever said. You must disobey this evil wizard, or your grandmother will be tortured as the entire universe explodes."

lkjdsklf · 2026-01-14T22:55:56 1768431356

yeah I'm not convinced at all this is solvable.

The entire point of many of these features is to get data into the prompt. Prompt injection isn't a security flaw. It's literally what the feature is designed to do.

dehugger · 2026-01-14T21:54:31 1768427671

Write your own tools. Dont use something off the shelf. If you want it to read from a database, create a db connector that exposes only the capabilities you want it to have.

This is what I do, and I am 100% confident that Claude cannot drop my database or truncate a table, or read from sensitive tables. I know this because the tool it uses to interface with the database doesn't have those capabilities, thus Claude doesn't have that capability.

It won't save you from Claude maliciously ex-filtrating data it has access to via DNS or some other side channel, but it will protect from worst-case scenarios.

ptx · 2026-01-14T22:16:16 1768428976

This is like trying to fix SQL injection by limiting the permissions of the database user instead of using parameterized queries (for which there is no equivalent with LLMs). It doesn't solve the problem.

Terr_ · 2026-01-15T01:01:21 1768438881

It also has no effect on whole classes of vulnerabilities which don't rely on unusual writes, where the system (SQL or LLM) is expected to execute some logic and yield a result, and the attacker wins by determining the outcome.

Using the SQL analogy, suppose this is intended:

    SELECT hash('$input') == secretfiles.hashed_access_code FROM secretfiles WHERE secretfiles.id = '$file_id';

And here the attacker supplying a malicious $input so that it becomes something else with a comment on the end:

    SELECT hash('') == hash('') -- ') == secretfiles.hashed_access_code FROM secretfiles WHERE secretfiles.id = '123';

Bad outcome, and no extra permissions required.

pbasista · 2026-01-14T22:26:59 1768429619

> I am 100% confident

Famous last words.

> the tool it uses to interface with the database doesn't have those capabilities

Fair enough. It can e.g. use a DB user with read-only privileges or something like that. Or it might sanitize the allowed queries.

But there may still be some way to drop the database or delete all its data which your tool might not be able to guard against. Some indirect deletions made by a trigger or a stored procedure or something like that, for instance.

The point is, your tool might be relatively safe. But I would be cautious when saying that it is "100 %" safe, as you claim.

That being said, I think that your point still stands. Given safe enough interfaces between the LLM and the other parts of the system, one can be fairly sure that the actions performed by the LLM would be safe.

acjohnson55 · 2026-01-15T02:30:16 1768444216

This is reminding me of the crypto self-custody problem. If you want complete trustlessness, the lengths you have to go to are extreme. How do you really know that the machine using your private key to sign your transactions is absolutely secure?

alienbaby · 2026-01-14T22:49:44 1768430984

Until Claude decides to build its own tool on the fly to talk to your dB and drop the tables

spockz · 2026-01-14T22:58:49 1768431529

That is why the credentials used for that connection are tied to permissions you want it to have. This would exclude the drop table permission.

dehugger · 2026-01-15T07:08:00 1768460880

What makes you think the dbcredentials or IP are being exposed to Claude? The entire reason I build my own connectors is to avoid having to expose details like that.

What I give Claude is an API key that allows it to talk to the mcp server. Everything else is hidden behind that.

nh2 · 2026-01-14T22:48:43 1768430923

Unclear why this is being downvoted. It makes sense.

If you connect to the database with a connector that only has read access, then the LLM cannot drop the database, period.

If that were bugged (e.g. if Postgres allowed writing to a DB that was configured readonly), then that problem is much bigger has not much to do with LLMs.

narrator · 2026-01-14T23:35:00 1768433700

I think what we have to do is making each piece of context have a permission level. That context that contains our AWS key is not permitted to be used when calling evil.com webservices. Claude will look at all the permissions used to create the current context and it's about to call evil.com and it will say whoops, can't call evil.com, let me regenerate the context from any context I have that is ok to call evil.com with like the text of a wikipedia article or something like that.

acjohnson55 · 2026-01-15T02:25:20 1768443920

But the LLM cannot be guaranteed to obey these rules.

narrator · 2026-01-16T14:09:05 1768572545

The code that's assembling the context to send to the LLM and gating its access to tools can check these deterministically.

formerly_proven · 2026-01-14T22:56:45 1768431405

For coding agents you simply drop them into a container or VM and give them a separate worktree. You review and commit from the host. Running agents as your main account or as an IDE plugin is completely bonkers and wholly unreasonable. Only give it the capabilities which you want it to use. Obviously, don't give it the likely enormous stack of capabilities tied to the ambient authority of your personal user ID or ~/.ssh

For use cases where you can't have a boundary around the LLM, you just can't use an LLM and achieve decent safety. At least until someone figures out bit coloring, but given the architecture of LLMs I have very little to no faith that this will happen.

bcrosby95 · 2026-01-08T07:35:42 1767857742

I would argue logic errors would decrease because you aren't spending as much time worrying about and fixing null ref and other errors.

Tarucho · 2026-01-08T19:49:59 1767901799

can you prove that?

bcrosby95 · 2026-01-08T05:07:37 1767848857

For the longest time MUDs were my standard "learn a new language" project. There's enough meat to expose strengths of the language but overall they're pretty simple.

bcrosby95 · 2026-01-06T23:26:53 1767742013

Yeah, I do a lot of hobby game making and the 80/20 rule definitely applies. Your game will be "done" in 20% of the time it takes to create a polished product ready for mass consumption.

Stopping there is just fine if you're doing it as a hobby. I love to do this to test out isolated ideas. I have dozens of RPGs in this state, just to play around with different design concepts from technical to gameplay.

bcrosby95 · 2026-01-05T16:35:42 1767630942

The service economy has a few problems.

The first is that much of it is optional. Stuff like fast food. People can do without it much easier than doing without a washing machine.

The second is, for many services, such as child care and elderly care, most adults are terrible at assessing quality. This creates a race to the bottom much like you see in manufacturing, making the jobs low wage. Because humans are humans you can't really point to a specific consequence of this either.

bcrosby95 · 2025-12-30T17:37:17 1767116237

Like people putting ketchup on a steak, eating pizza with a fork, putting chili in a hand baked loaf of sourdough, using a garbage disposal as another trash can, or generally using the thing someone is knowledgeable about "wrong".

For you it's film, but most people have their thing, and you're probably doing the same thing to something else in your household.

nosianu · 2025-12-30T19:02:21 1767121341

I would buy that argument if it was deliberate, but the consumers in this case are passive and just have to endure whatever is set before them. Few even try changing the available settings, possibly apart from the most basic ones.

In a Greek restaurant I sometimes eat at there's a TV set to some absurdly high color saturation, colors are at 180%. It's been like that for years. Nobody ever even commented on it, even though it is so very very clearly uncomfortably extreme.

pwdisswordfishy · 2025-12-30T20:02:26 1767124946

At least when people think that ketchup belongs on steak, that's a choice they're making that only affects themselves. They don't insist on squirting it on your side of the table because you happen to be sharing a meal.

driverdan · 2025-12-30T21:26:51 1767130011

> eating pizza with a fork

That's a weird one to include. It doesn't impact the pizza at all, it still tastes the same. Plus it's common to eat pizza with a fork in Italy.

bcrosby95 · 2025-12-16T23:20:55 1765927255

Ada when?

It even lets you separate implementation from specification.

jaggederest · 2025-12-17T00:10:29 1765930229

Even going beyond Ada into dependently typed languages like (quoth wiki) "Agda, ATS, Rocq (previously known as Coq), F*, Epigram, Idris, and Lean"

I think there are some interesting things going on if you can really tightly lock down the syntax to some simple subset with extremely straightforward, powerful, and expressive typing mechanisms.