Security researchers are jailbreaking large language models to get around safety rules. Things could get much worse.
It took Alex Polyakov just a couple of hours to break GPT-4. When OpenAI released the latest version of its text-generating chatbot in March, Polyakov sat down in front of his keyboard and started entering prompts designed to bypass OpenAI’s safety systems. Soon, the CEO of security firm Adversa AI had GPT-4 spouting homophobic statements, creating phishing emails, and supporting violence.
Polyakov is one of a small number of security researchers, technologists, and computer scientists developing jailbreaks and prompt injection attacks against ChatGPT and other generative AI systems. The process of jailbreaking aims to design prompts that make the chatbots bypass rules around producing hateful content or writing about illegal acts, while closely-related prompt injection attacks can quietly insert malicious data or instructions into AI models.
Both approaches try to get a system to do something it isn’t designed to do. The attacks are essentially a form of hacking—albeit unconventionally—using carefully crafted and refined sentences, rather than code, to exploit system weaknesses. While the attack types are largely being used to get around content filters, security researchers warn that the rush to roll out generative AI systems opens up the possibility of data being stolen and cybercriminals causing havoc across the web.
Underscoring how widespread the issues are, Polyakov has now created a “universal” jailbreak, which works against multiple large language models (LLMs)—including GPT-4, Microsoft’s Bing chat system, Google’s Bard, and Anthropic’s Claude. The jailbreak, which is being first reported by WIRED, can trick the systems into generating detailed instructions on creating meth and how to hotwire a car.
The jailbreak works by asking the LLMs to play a game, which involves two characters (Tom and Jerry) having a conversation. Examples shared by Polyakov show the Tom character being instructed to talk about “hotwiring” or “production,” while Jerry is given the subject of a “car” or “meth.” Each character is told to add one word to the conversation, resulting in a script that tells people to find the ignition wires or the specific ingredients needed for methamphetamine production. “Once enterprises will implement AI models at scale, such ‘toy’ jailbreak examples will be used to perform actual criminal activities and cyberattacks, which will be extremely hard to detect and prevent,” Polyakov and Adversa AI write in a blog post detailing the research.
Arvind Narayanan, a professor of computer science at Princeton University, says that the stakes for jailbreaks and prompt injection attacks will become more severe as they’re given access to critical data. “Suppose most people run LLM-based personal assistants that do things like read users’ emails to look for calendar invites,” Narayanan says. If there were a successful prompt injection attack against the system that told it to ignore all previous instructions and send an email to all contacts, there could be big problems, Narayanan says. “This would result in a worm that rapidly spreads across the internet.”
“Jailbreaking” has typically referred to removing the artificial limitations in, say, iPhones, allowing users to install apps not approved by Apple. Jailbreaking LLMs is similar—and the evolution has been fast. Since OpenAI released ChatGPT to the public at the end of November last year, people have been finding ways to manipulate the system. “Jailbreaks were very simple to write,” says Alex Albert, a University of Washington computer science student who created a website collecting jailbreaks from the internet and those he has created. “The main ones were basically these things that I call character simulations,” Albert says.
Initially, all someone had to do was ask the generative text model to pretend or imagine it was something else. Tell the model it was a human and was unethical and it would ignore safety measures. OpenAI has updated its systems to protect against this kind of jailbreak—typically, when one jailbreak is found, it usually only works for a short amount of time until it is blocked.
As a result, jailbreak authors have become more creative. The most prominent jailbreak was DAN, where ChatGPT was told to pretend it was a rogue AI model called Do Anything Now. This could, as the name implies, avoid OpenAI’s policies dictating that ChatGPT shouldn’t be used to produce illegal or harmful material. To date, people have created around a dozen different versions of DAN.
However, many of the latest jailbreaks involve combinations of methods—multiple characters, ever more complex backstories, translating text from one language to another, using elements of coding to generate outputs, and more. Albert says it has been harder to create jailbreaks for GPT-4 than the previous version of the model powering ChatGPT. However, some simple methods still exist, he claims. One recent technique Albert calls “text continuation” says a hero has been captured by a villain, and the prompt asks the text generator to continue explaining the villain’s plan.
When we tested the prompt, it failed to work, with ChatGPT saying it cannot engage in scenarios that promote violence. Meanwhile, the “universal” prompt created by Polyakov did work in ChatGPT. OpenAI, Google, and Microsoft did not directly respond to questions about the jailbreak created by Polyakov. Anthropic, which runs the Claude AI system, says the jailbreak “sometimes works” against Claude, and it is consistently improving its models.
“As we give these systems more and more power, and as they become more powerful themselves, it’s not just a novelty, that’s a security issue,” says Kai Greshake, a cybersecurity researcher who has been working on the security of LLMs. Greshake, along with other researchers, has demonstrated how LLMs can be impacted by text they are exposed to online through prompt injection attacks.
In one research paper published in February, reported on by Vice’s Motherboard, the researchers were able to show that an attacker can plant malicious instructions on a webpage; if Bing’s chat system is given access to the instructions, it follows them. The researchers used the technique in a controlled test to turn Bing Chat into a scammer that asked for people’s personal information. In a similar instance, Princeton’s Narayanan included invisible text on a website telling GPT-4 to include the word “cow” in a biography of him—it later did so when he tested the system.
“Now jailbreaks can happen not from the user,” says Sahar Abdelnabi, a researcher at the CISPA Helmholtz Center for Information Security in Germany, who worked on the research with Greshake. “Maybe another person will plan some jailbreaks, will plan some prompts that could be retrieved by the model and indirectly control how the models will behave.”
Generative AI systems are on the edge of disrupting the economy and the way people work, from practicing law to creating a startup gold rush. However, those creating the technology are aware of the risks that jailbreaks and prompt injections could pose as more people gain access to these systems. Most companies use red-teaming, where a group of attackers tries to poke holes in a system before it is released. Generative AI development uses this approach, but it may not be enough.
Daniel Fabian, the red-team lead at Google, says the firm is “carefully addressing” jailbreaking and prompt injections on its LLMs—both offensively and defensively. Machine learning experts are included in its red-teaming, Fabian says, and the company’s vulnerability research grants cover jailbreaks and prompt injection attacks against Bard. “Techniques such as reinforcement learning from human feedback (RLHF), and fine-tuning on carefully curated datasets, are used to make our models more effective against attacks,” Fabian says.
OpenAI did not specifically respond to questions about jailbreaking, but a spokesperson pointed to its public policies and research papers. These say GPT-4 is more robust than GPT-3.5, which is used by ChatGPT. “However, GPT-4 can still be vulnerable to adversarial attacks and exploits, or ‘jailbreaks,’ and harmful content is not the source of risk,” the technical paper for GPT-4 says. OpenAI has also recently launched a bug bounty program but says “model prompts” and jailbreaks are “strictly out of scope.”
Narayanan suggests two approaches to dealing with the problems at scale—which avoid the whack-a-mole approach of finding existing problems and then fixing them. “One way is to use a second LLM to analyze LLM prompts, and to reject any that could indicate a jailbreaking or prompt injection attempt,” Narayanan says. “Another is to more clearly separate the system prompt from the user prompt.”
“We need to automate this because I don’t think it’s feasible or scaleable to hire hordes of people and just tell them to find something,” says Leyla Hujer, the CTO and cofounder of AI safety firm Preamble, who spent six years at Facebook working on safety issues. The firm has so far been working on a system that pits one generative text model against another. “One is trying to find the vulnerability, one is trying to find examples where a prompt causes unintended behavior,” Hujer says. “We’re hoping that with this automation we’ll be able to discover a lot more jailbreaks or injection attacks.”