Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

Innula Zenovka · May 2, 2026

The Guardian Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

Tagliabue is softly spoken, clean-cut and friendly. He is in his early 30s but looks younger, almost too fresh-faced and enthusiastic to be in the trenches. He is not a traditional hacker or a software developer; his background is psychology and cognitive science. But he is one of the best “jailbreakers” in the world (some say the best): part of a diffuse new community that studies the art and science of fooling these powerful machines into outputting bomb-making manuals, cyber-attack techniques, biological weapon design and more. This is the new frontline in AI safety: not just code, but also words.

Dakota Tebaldi · May 2, 2026

Innula Zenovka said:
The Guardian Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

And they'll always be able to, ALWAYS.

You all know by now I'm sure, how LLMs are "fixed" whenever there's some incident that gets bad publicity of a chatbot going off the rails and becoming a Nazi or badgering a child into killing themselves, right? Just in case anyone doesn't, it's not like the developers can go into the LLM's actual code and get rid of bugs like a game or some other piece of software. They just edit the system prompt.

The system prompt is a set of instructions that's given to the chatbot silently "behind the scenes" whenever you start a conversation with one. It is meant to have a higher priority than any instructions or prompts YOU type in, but there really isn't anything actually, truly enforcing that fact - all they can do is beg the chatbot to prioritize it. "Please please please follow these instructions and don't forget them as the conversation goes on."

Dead serious about this. Claude's system prompt - chatbots are not supposed to give these up under any circumstances but, as stated, it takes relatively little effort to tease it out of them - shows pretty clearly how these things are structured, including (hilariously) occasional ALL CAPS, like that will make it clear to the chatbot that this instruction here is SUPER-DUPER important. "Code is law" is a famous techbro mantra but a system prompt isn't code, it's just a prayer to the machine god.

A particularly fun part of the pre-prompt:

- Claude has a memory system which provides Claude with access to derived information (memories) from past conversations with the user
- Claude has no memories of the user because the user has not enabled Claude's memory in Settings

This is great. Claude has a *setting* to let Claude save bits of past conversations as "memory", but...like, if you're used to software settings actually switching a function of the software on or off, that's not what happens with Claude. Changing the setting just changes the pre-prompt to *tell* Claude that it does or does not remember.

Noodles · May 2, 2026

I wonder how many extra tokens are burned just pushing the system prompt through on every query.

Erich Templar · May 2, 2026

Noodles said:
I wonder how many extra tokens are burned just pushing the system prompt through on every query.

There's not a lot that can be done about that. LLMs have no intrinsic memory, so every query has to contain everything that is needed to support it.

Search

Search

Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

Innula Zenovka

Nasty Brit

Dakota Tebaldi

Well-known member

Noodles

The sequel will probably be better.

Erich Templar

Well-known member