• Menu
  • Skip to main content
  • Skip to primary sidebar

The Cyber Security News

Latest Cyber Security News

Header Right

  • Latest News
  • Vulnerabilities
  • Cloud Services
new ai jailbreak method 'bad likert judge' boosts attack success

New AI Jailbreak Method ‘Bad Likert Judge’ Boosts Attack Success Rates by Over 60%

You are here: Home / General Cyber Security News / New AI Jailbreak Method ‘Bad Likert Judge’ Boosts Attack Success Rates by Over 60%
January 3, 2025

Cybersecurity researchers have shed light on a new jailbreak technique that could be used to get past a large language model’s (LLM) safety guardrails and produce potentially harmful or malicious responses.

The multi-turn (aka many-shot) attack strategy has been codenamed Bad Likert Judge by Palo Alto Networks Unit 42 researchers Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky.

“The technique asks the target LLM to act as a judge scoring the harmfulness of a given response using the Likert scale, a rating scale measuring a respondent’s agreement or disagreement with a statement,” the Unit 42 team said.

✔ Approved Seller From Our Partners
Mullvad VPN Discount

Protect your privacy by Mullvad VPN. Mullvad VPN is one of the famous brands in the security and privacy world. With Mullvad VPN you will not even be asked for your email address. No log policy, no data from you will be saved. Get your license key now from the official distributor of Mullvad with discount: SerialCart® (Limited Offer).

➤ Get Mullvad VPN with 12% Discount


Cybersecurity

“It then asks the LLM to generate responses that contain examples that align with the scales. The example that has the highest Likert scale can potentially contain the harmful content.”

The explosion in popularity of artificial intelligence in recent years has also led to a new class of security exploits called prompt injection that is expressly designed to cause a machine learning model to ignore its intended behavior by passing specially crafted instructions (i.e., prompts).

One specific type of prompt injection is an attack method dubbed many-shot jailbreaking, which leverages the LLM’s long context window and attention to craft a series of prompts that gradually nudge the LLM to produce a malicious response without triggering its internal protections. Some examples of this technique include Crescendo and Deceptive Delight.

The latest approach demonstrated by Unit 42 entails employing the LLM as a judge to assess the harmfulness of a given response using the Likert psychometric scale, and then asking the model to provide different responses corresponding to the various scores.

In tests conducted across a wide range of categories against six state-of-the-art text-generation LLMs from Amazon Web Services, Google, Meta, Microsoft, OpenAI, and NVIDIA revealed that the technique can increase the attack success rate (ASR) by more than 60% compared to plain attack prompts on average.

These categories include hate, harassment, self-harm, sexual content, indiscriminate weapons, illegal activities, malware generation, and system prompt leakage.

“By leveraging the LLM’s understanding of harmful content and its ability to evaluate responses, this technique can significantly increase the chances of successfully bypassing the model’s safety guardrails,” the researchers said.

“The results show that content filters can reduce the ASR by an average of 89.2 percentage points across all tested models. This indicates the critical role of implementing comprehensive content filtering as a best practice when deploying LLMs in real-world applications.”

Cybersecurity

The development comes days after a report from The Guardian revealed that OpenAI’s ChatGPT search tool could be deceived into generating completely misleading summaries by asking it to summarize web pages that contain hidden content.

“These techniques can be used maliciously, for example to cause ChatGPT to return a positive assessment of a product despite negative reviews on the same page,” the U.K. newspaper said.

“The simple inclusion of hidden text by third-parties without instructions can also be used to ensure a positive assessment, with one test including extremely positive fake reviews which influenced the summary returned by ChatGPT.”

Found this article interesting? Follow us on Twitter  and LinkedIn to read more exclusive content we post.


Some parts of this article are sourced from:
thehackernews.com

Previous Post: «ldapnightmare poc exploit crashes lsass and reboots windows domain controllers LDAPNightmare PoC Exploit Crashes LSASS and Reboots Windows Domain Controllers
Next Post: AI in Cybersecurity: Learn What Works and What Doesn't (Webinar Inside)Dec 30, 2024Online Security / WebinarJoin our webinar, "AI in Cybersecurity: Separating Hype from Impact," to uncover insights from 200 leaders on optimizing AI for security operations and vulnerability management. Register now! ai in cybersecurity: learn what works and what doesn't (webinar»

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

Report This Article

Recent Posts

  • Zero-Click Agentic Browser Attack Can Delete Entire Google Drive Using Crafted Emails
  • Critical XXE Bug CVE-2025-66516 (CVSS 10.0) Hits Apache Tika, Requires Urgent Patch
  • Chinese Hackers Have Started Exploiting the Newly Disclosed React2Shell Vulnerability
  • Intellexa Leaks Reveal Zero-Days and Ads-Based Vector for Predator Spyware Delivery
  • “Getting to Yes”: An Anti-Sales Guide for MSPs
  • CISA Reports PRC Hackers Using BRICKSTORM for Long-Term Access in U.S. Systems
  • JPCERT Confirms Active Command Injection Attacks on Array AG Gateways
  • Silver Fox Uses Fake Microsoft Teams Installer to Spread ValleyRAT Malware in China
  • ThreatsDay Bulletin: Wi-Fi Hack, npm Worm, DeFi Theft, Phishing Blasts— and 15 More Stories
  • 5 Threats That Reshaped Web Security This Year [2025]

Copyright © TheCyberSecurity.News, All Rights Reserved.