Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

Innula Zenovka

Nasty Brit
VVO Supporter 🍦🎈👾❤
Joined
Sep 20, 2018
Messages
23,923
SLU Posts
18459
The Guardian Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

Tagliabue is softly spoken, clean-cut and friendly. He is in his early 30s but looks younger, almost too fresh-faced and enthusiastic to be in the trenches. He is not a traditional hacker or a software developer; his background is psychology and cognitive science. But he is one of the best “jailbreakers” in the world (some say the best): part of a diffuse new community that studies the art and science of fooling these powerful machines into outputting bomb-making manuals, cyber-attack techniques, biological weapon design and more. This is the new frontline in AI safety: not just code, but also words.
 

Dakota Tebaldi

Well-known member
VVO Supporter 🍦🎈👾❤
Joined
Sep 19, 2018
Messages
9,825
Location
Ohio
Joined SLU
02-22-2008
SLU Posts
16791

And they'll always be able to, ALWAYS.

You all know by now I'm sure, how LLMs are "fixed" whenever there's some incident that gets bad publicity of a chatbot going off the rails and becoming a Nazi or badgering a child into killing themselves, right? Just in case anyone doesn't, it's not like the developers can go into the LLM's actual code and get rid of bugs like a game or some other piece of software. They just edit the system prompt.

The system prompt is a set of instructions that's given to the chatbot silently "behind the scenes" whenever you start a conversation with one. It is meant to have a higher priority than any instructions or prompts YOU type in, but there really isn't anything actually, truly enforcing that fact - all they can do is beg the chatbot to prioritize it. "Please please please follow these instructions and don't forget them as the conversation goes on."

Dead serious about this. Claude's system prompt - chatbots are not supposed to give these up under any circumstances but, as stated, it takes relatively little effort to tease it out of them - shows pretty clearly how these things are structured, including (hilariously) occasional ALL CAPS, like that will make it clear to the chatbot that this instruction here is SUPER-DUPER important. "Code is law" is a famous techbro mantra but a system prompt isn't code, it's just a prayer to the machine god.

A particularly fun part of the pre-prompt:

- Claude has a memory system which provides Claude with access to derived information (memories) from past conversations with the user
- Claude has no memories of the user because the user has not enabled Claude's memory in Settings
This is great. Claude has a *setting* to let Claude save bits of past conversations as "memory", but...like, if you're used to software settings actually switching a function of the software on or off, that's not what happens with Claude. Changing the setting just changes the pre-prompt to *tell* Claude that it does or does not remember.
 

Noodles

The sequel will probably be better.
Joined
Sep 20, 2018
Messages
6,035
Location
Illinois
SL Rez
2006
Joined SLU
04-28-2010
SLU Posts
6947
I wonder how many extra tokens are burned just pushing the system prompt through on every query.
 

Erich Templar

Well-known member
VVO Supporter 🍦🎈👾❤
Joined
Sep 20, 2018
Messages
608
Location
UK
I wonder how many extra tokens are burned just pushing the system prompt through on every query.
There's not a lot that can be done about that. LLMs have no intrinsic memory, so every query has to contain everything that is needed to support it.
 
  • 1Facepalm
Reactions: Beebo Brink