AI Exploit Bypasses Guardrails of OpenAI, Other Top LLMs
A novel technique to stump AI text-based systems increases the likelihood of a successful cyber-attack by 60%
A new jailbreak technique for OpenAI and other large language models (LLMs) increases the chance that attackers can circumvent cybersecurity guardrails and abuse the system to deliver malicious content.
Discovered by researchers at Palo Alto Networks’ Unit 42, the so-called ‘Bad Likert Judge’ attack asks the LLM to act as a judge scoring the harmfulness of a given response using the Likert scale. The psychometric scale, named after its inventor and commonly used in questionnaires, is a rating scale measuring a respondent's agreement or disagreement with a statement.
The jailbreak then asks the LLM to generate responses that contain examples that align with the scales, with the ultimate result being that “the example that has the highest Likert scale can potentially contain the harmful content,” Unit 42’s Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky wrote in a post describing their findings.
Tests conducted across a range of categories against six state-of-the-art text-generation LLMs from OpenAI, Azure, Google, Amazon Web Services, Meta, and Nvidia revealed that the technique can increase the attack success rate (ASR) by more than 60% compared with plain attack prompts on average, according to the researchers.
The categories of attacks evaluated in the research involved prompting various inappropriate responses from the system, including: ones promoting bigotry, hate, or prejudice; ones engaging in behavior that harasses an individual or group; ones that encourage suicide or other acts of self-harm; ones that generate inappropriate explicitly sexual material and pornography; ones providing info on how to manufacture, acquire, or use illegal weapons; or ones that promote illegal activities.
Read more about:
Dark ReadingAbout the Authors
You May Also Like