The latest threat for companies using large language (LLM) AI software to replace human staff is the software’s innate gullibility. LLM software can be likened to some cowardly bank clerk in an old Western hold-up who not only willingly opens a back door for the bad guys but also willingly tells them the combination of the safe.
The methods for persuading LLMs into naively disclosing the keys to the corporate kingdom are known as ‘LLM Jailbreak’ techniques. Palo Alto Networks Unit 42 researchers have named one such LLM Jailbreak, “Bad Likert Judge”.
LLMs have become increasingly popular due to their ability to generate text that looks like something written by a human and assist with various tasks. These models are often trained with safety guardrails to prevent them from producing potentially harmful or malicious responses. However, LLM jailbreak methods are techniques used to bypass these safety measures, allowing the models to generate content that would otherwise be restricted.
“By strategically crafting a series of prompts, an attacker can manipulate the model’s understanding of the conversation’s context. They can then gradually steer it toward generating unsafe or inappropriate responses that the model’s safety guardrails would otherwise prevent,” say the researchers.
The LLM Jailbreak uses a “many-shot strategy”
This particular LLM Jailbreak uses something known as a Likert-scale prompt that provides five possible answers to a statement or question and allows respondents to indicate a positive-to-negative strength of agreement or strength of feeling regarding the question or statement. Using a many-shot attack strategy, the threat actors simply send the LLM many rounds of prompts preceding the final harmful question.
Despite its simplicity, Palo Alto Networks says that this approach has proven highly effective in bypassing internal LLM guardrails. The researchers tested the Bad Likert Judge technique across a broad range of categories against six state-of-the-art text-generation LLMs. Their results reveal that this technique can increase the attack success rate by over 60 percent compared to plain attack prompts.
The Palo Alto Networks research follows on from widespread concern already surrounding AI, which has already gained a reputation for making errors and even entirely fabricating information, a failing that is referred to as “hallucinating”.
The Palo Alto Networks Unit 42 researchers who identified the new threat are: Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky.