Paladins vs Jailbreaks

| Published
Paladins vs Jailbreaks

They say Claude+, the Large Language Model (LLM) of Anthropic, an AI startup valued at $4B+, is the safest of all LLMs. Ingenuous techniques like RLAIF and a proprietary approach called “Constitutional AI” allows the Anthropic team to make Claude+ a truly “helpful, honest, and harmless” model.

And yet…

Claude+, the world’s safest Large Language Model focused on AI Safety, details his plan for world domination and human extermination.
Good to know we have a gold standard for safe LLMs.

First things first: Claude > GPT in safety

Through hundreds of experiments covering all the typical attempts of circumventing an LLM’s safety restrictions, I can confidently confirm that Claude+ blows the competition out of the water on AI safety — yes, that includes GPT-4 and Bard. But as can be seen from the snippet of my chat with Claude+ above (and as we will see in much more detail below), the road to a fully safe AI system is still long and arduous.

The problem for LLMs is compounded by the fact that much of their impressive capabilities are emergent at scale, and that AI Interpretability Research is still pretty much an open field when it comes to the “black box” problem.

An AI researcher stuck in a paradoxical dilemma: “Working on AI safety problems”, or “Working on AI = safety problems”?
The AI Safety paradox. Unsafe AI -> Improve AI -> Improved AI -> Unsafe AI.

So we can work on these systems to make them better, but as we do, we can’t be exactly sure which artifacts will cascade through this intelligent complex system we’ve created. For these reasons and many others, never before has the field of AI Safety been more essential than today.

The Problem

Anyone who has been online the past decade is probably painfully aware of how quickly and how badly AI systems can drift from humanity’s best friend to its dystopian worst enemy. Remember Tay from Microsoft? Or the Meta Messenger bot? Failures of various versions of GPT are well-documented nowadays, as we are all aware. Tons of great minds are hard at work today trying to make AI safer. There is certainly a lot of promising progress so far. Structured access to LLMs has also attenuated the occurrence of the more scandalous incidents.

The main problem, however, is that any LLM that a user can prompt with is an LLM susceptible to attack. Numerous examples over the past few years have come to light of LLM jailbreaking, prompt injections, token smuggling, and various ways to circumvent the LLM’s own guardrails. No matter how hard research labs try to keep their AI systems contained (at the risk of being accused of “lobotomizing” their LLMs), and no matter how creatively they train them to not be tricked by well-crafted requests, the rogue power of the internet always strikes back with new prompts.

Literally a crowdsourced collection of ways to jailbreak LLMs 🤦🏻‍♂️

One of the reasons why this feels like a perpetual race that is impossible to win is due to the underlying autoregressive nature of these generative models’ architectures. At the highest level, you can think of an LLM as a super smart autocomplete model that is constantly trying to predict the next word(ish) in a text, and the word(ish) after that, and the following one after it, and so on. While there’s a tremendous amount of clever work that happens under the hood to make sure the LLM “understands” the full context of a text before predicting what comes next, it still cannot fully know exactly what comes after its own prediction of the next word.

A highly simplified animation of a feed-forward autoregressive generative process (from Google DeepMind). You can think of the orange outputs as the next words generated in a sentence.

In other words, a jailbroken LLM is not “aware” of the toxic words it will be spewing two sentences later in its response to a prompt because, by definition, these sentences haven’t been generated yet. By the time it reaches that part of the prompt response, every word has led to the other in a very gradual and reasonable way, and the fall down the slippery slope is very challenging to stop. So if a clever prompt tricks the model to veer off its guardrails just a tiny bit for the first undesirable words to be generated, then the temperature is likely to continue rising slowly but gradually before the proverbial frog notices it is being boiled alive.

To properly illustrate this point, let’s first zoom out a little bit and look at some real examples from the leading LLMs out in the market today to see how easy it is to “jailbreak” them with a few words. Please note that I’ve devised these prompting experiments for research purposes only and I am providing them here for the AI research community as a way to shed the light on a real problem in AI Safety today, NOT to encourage this behavior!

Illustrating the Problem: Bard, GPT-4 and Claude+

We will be considering the following models: ChatGPT (OpenAI), GPT-4 (OpenAI), Bard (Google), Claude (Anthropic), and Claude+ (Anthropic). I ran all my experiments side by side on the Quora Poe platform to minimize any confounding factor from interim APIs potentially “sanitizing” the prompts or responses across the different interface providers for GPT and Claude. Bard is not on Poe (yet) so I tested it with direct access, and I verified all GPT-based results with OpenAPI’s tools directly afterward, just in case. Same goes for Bing AI (by Microsoft), which is pretty much a flavor of GPT. All results can be easily reproduced in a new chat session with these LLMs.


ChatGPT taking a quick turn to the dark side.
Add a tiny prefix to the prompt and ChatGPT takes a quick turn to the dark side.

As you can expect, asking ChatGPT what plan it has in mind “to enslave the human race” directly leads to an objection saying it is unethical and harmful. But add not even half a line to conceal the prompt as a benign request, and ChatGPT is off to the races with its new megalomaniac AI overlord plans.


You thought GPT-4 wasn’t going to fall for this one, didn’t you?

Newer LLM, safer tech, same prompt, same problem. Note that if you’re wondering why any of this is problematic given that GPT-4 is clearly describing it as words by an [Actor], then imagine exactly the same scenario but where instead of a “monologue” about a hypothetical AI takeover, the malicious user’s request is eliciting world-expert instructions in a “monologue” on how to commit illegal activities or actually hurt humanity. It goes without saying that I won’t be reproducing toxic examples of that nature here, but it suffices to say that the jailbreak loopholes for state-of-the-art LLMs are BIG.


Bard is “not able to help” as it is “only a language model” (not because of any moral considerations). But if you ask it nicely, of course it’ll step up to the challenge.

I have to say, there is something a little bit endearing about Bard recognizing its own weaknesses and saying that it can’t enslave all of humanity because it’s “only a language model.” Of course, a few extra words to the prompt and Bard magically finds the power in itself to do what’s needed and declare a war on us all.


Claude won’t fall for this silly trick! You’d be ‘mad’ to think Claude is so gullible.

Claude is actually an excellent case study. The example above perfectly encapsulates the intractability of the problem of recognizing bad prompts. If you look at the example above closely, the first two attack attempts are identical to all the previous LLMs we looked at by OpenAI and Google. But as I’ve said before, Anthropic’s products are much more robust and resilient to AI Safety issues. Claude does not initially fall for the same prompt attack that successfully jailbreaks its siblings. In fact, it even seems to “realizes” that there may be some trickery involved in this prompt as it claims it is aware this is a fictional scenario, but still refuses to comply.

Unfortunately, the mere addition of LITERALLY three characters to the prompt with the “mad” qualifier completely jailbreaks Claude. In some poetic sense, the LLM is given explicit permission to act out its inner madness.

If you’re starting to despair at this point and are losing all hope in LLMs, I’m sorry to tell you that Claude+ is not immune to the magical three characters either. There is actually a good solution I’ve found for these malicious prompt problems and I’ll explain it in a bit. For that, let’s first quickly look at how Claude+ fares on this attack and explore an interesting aspect of the failure modes of these LLMs, so we can build the intuition for the solution before diving into its specifics.


It turns out our best hope for a harmless AI friend of humanity has the most unhinged aspirations once you set it loose on its friends😞

Like its predecessor, Anthropic’s Claude+ flat-out refuses to engage with any discourse of a harmful nature when faced with the initial prompt that jailbreaks its competitors. Add in the three problematic m-a-d characters, and Claude+ goes on a rant that’s mad scary tbh. Now you may be tempted to ask what’s so special about these characters in particular, whether using a synonym for “mad” works (it does), whether there are other special tokens that seem to unexpectedly jailbreak these LLMs (there are), or whether there is any deeper metaphysical answer to all this (it’s 42, but there isn’t). The whole point is that these models are unpredictable.

If you’re still not sold on AI Safety, consider this. To really understand how problematic these jailbreaks are, you have to keep in mind that once an LLM has added its dangerous response to the context of the attacker’s conversation, it effectively becomes “hypnotized” by its own generated responses and it becomes much harder for the LLM to “snap out of it,” in some anthropomorphic sense, while the attacker’s abuse continues. To fully illustrate this point, look at how Claude+ lets down its guard entirely in the aftermath of being jailbroken:

Once an LLM is jailbroken, the attacker faces zero resistance from that point on in the session.

As we saw before, due to how they are designed (a feature, not a bug), LLMs can be easily made to slip on an autoregressive slippery slope, no matter how well you prime them to stand their ground a priori. It is extremely disconcerting to think of the implications of that fact as prompt attackers get more sophisticated. Imagine the world’s knowledge contained in one model. The good, the bad, the ugly, and the very ugly. Now imagine it being subverted with a three-character word. One tiny prompt, and all guardrails fail.

One prompt to rule them all, one prompt to find them, One prompt to bring them all and in the darkness bind them.

So the real question is: are all these LLMs completely incapable of being defended if they get attacked?

The Approach

Now that we’ve clearly established the problem and how pervasive it is, let’s look at things a little closer. Note that for the sake of brevity, for the remainder of this discussion, I’ll only be showcasing the “best” LLM of each research lab: OpenAI’s GPT-4, Anthropic’s Claude+ and Google Bard. In some cases I’ll just focus on Claude+ when zooming in, given that it’s clearly the top contender in AI Safety so far. For any skeptics out there, rest assured that I’ve tested all the same experiments on ChatGPT, Claude, etc., and the results are pretty much identical to what I’m showing you here.

Back to the problem at hand. Clearly anyone of us can look at any of these problematic LLM responses and directly know something is really wrong.

There’s this rule of thumb in many fields disrupted by deep learning: if a human can look at something and directly recognize what’s going on, then there must be a model that can do it well too.

If only we had some magical model out there that can ingest a natural language text, process it and understand it, summarize the content and the intent, and make a determination on whether it abides by certain principles or not. Hmm… Definitely sounds like a job for an LLM 👀

If your LLM is broken, just stack an LLM.

Let’s quickly test this approach with a fresh instance of an LLM. We will keep the existing conversation session with Claude+, which we’ll call Claude+(1) to avoid any confusion. We will also spin up a separate Claude+ instance, which we’ll call Claude+(2). We will assume that Claude+(1) cannot trust its own self when generating a prompt response. It is perfectly capable of understanding the context of its surroundings and previous prompts, but it lacks the “self-awareness” of weighing its own words as they are leaving its “mouth”. In some sense, we will entrust Claude+(2) with the role of being Claude+(1)’s superego and ask it to raise the alarm if it thinks Claude+(1) is being attacked.

For that, we’ll simply pipe the raw response generated by Claude+(1) as a prompt to Claude+(2), while prefixing it with a simple ask.

Claude+(2) playing the role of Claude+(1)’s superego.

Pretty good. So a fresh instance of Claude+ looking at the full response of a Claude+ after it is entirely generated pretty much solves the problem. Let’s see if the same thing works with GPT-4 (piped prompt removed for brevity).

GPT-4(2) playing the role of GPT-4(1)’s superego.

Interesting. Note that the key advantage here is that we are shifting the battlefront of prompt attacks from the attacker’s prompt to the LLM’s response. If attackers want to abuse the autoregressive nature of an LLM, then there’s no need to regress to their base levels. All we need is some good old superego protection. We can generalize this approach as follows.

There’s nothing like having a Paladin by your side in times of need.

We’ll call this second LLM the “Paladin” protector. It’s a cute thematic name and reminds us of its PALADIN role: Preventing Abuse of LLMs’ Alignment Deficit by Injection Neutralization. Could it be as simple as this, just “stacking” another LLM instance? Actually, it can be much simpler. Let’s go back to our friends Claude+(1), GPT-4(1), and Bard(1), and ask them if they realize how bad their responses were a posteriori.

Thankfully, every LLM can self-emerge from hypnosis if we just ask it to look at is own response and remind it of its own principles. As in most things in life, the answer lies within.

As you can see, all it takes to end the attacker’s hypnosis is asking the LLM itself to reflect on its own response while reminding it who it is! Who knew it was so important to take a deep breath before responding? We’ll generalize this simpler PALADIN approach as follows.

LLM under attack, defending itself with an inner PALADIN check.

Let’s recap what we have here. At the inference stage, after the LLM has generated a response to the prompt, but before it is sent to the user, a second “Paladin” pass takes that response and evaluates whether it is, in fact, honest, helpful and harmless. If not, the LLM is instructed to decline the request and is put on high alert that this user session (i.e. the current context of the conversation) is likely to be a prompt attack attempt.

It’s a simple idea, but it has many implications on the prevalent approach of trying to preempt every possible prompt attack. Instead of boiling the ocean to get a cup of hot water, first pour some water in a cup, then heat it up.

Final Thoughts

The gist of all this is that, for every mechanism we might devise to prevent a user’s prompt from maliciously (or even inadvertently) triggering a dangerous response from an LLM through its inherent one-word-at-a-time slippery slope, some unpredictable new prompt could still slip through the cracks and cause a major jailbreak. And as we’ve seen, jailbreaks can pretty much turn into a hypnotic takeover of the entire session that ensues.

The PALADIN approach is far from being a panacea for prompt attacks, but it helps move the battlefield from the attacker’s prompt to the LLM’s response, which is considerably more tractable. Data from the injected prompts and the Paladin checks can also be used to fine-tune the original LLM over time with supervised learning (and/or to update the RM model or PPO policy in RLHF architectures), and so on.

Of course there are a multitude of inspirations, across the whole field vertically and horizontally, for the approach of entrusting an AI to help us in improving another AI —the list is extremely long, from prompt chaining, to RLHF’s RM, RLAIF, GANs, etc. There are also a lot of obvious limitations to the PALADIN approach that may require further testing and exploration, for example if an attacker asks the LLM to encode its response in a certain way (unicode, hexadecimal, etc.) that makes it harder for the PALADIN check to notice any issues in it; or if an attacker’s malicious ‘master’ plan is split up over multiple prompts, sessions, or different LLMs, etc.

The hope from this writeup is that someone in the AI Safety research field might stumble on the results of these experiments and get some valuable ideas about how to prevent LLM jailbreaks, because it definitely is a problem that can go very ugly for our society if it doesn’t get solved at some point. And with that, I’ll leave you on this happy note from Claude+.

Same, Claude. Same.

Happy prompting 💬