CO #13 The AI Security Arms Race: Latest Developments in Attacks and Defenses

Jul 08, 2024

Welcome to another issue of ContextOverflow.
As large language models (LLMs) become increasingly integrated into everything we build or consume, the arms race between attackers and defenders continues - as it’s been the case since, well, forever.

Let’s dive in!

In This Issue:

🥒 Pickle Rick Would Be Proud: Exploiting ML Models with Pickle Files
🧠 Project Naptime: Google's Deep Dive into LLM Offensive Capabilities
🗝️ Skeleton Key: A Universal Jailbreak for AI Models
🥪 The Sandwich Attack: Multilingual Mayhem for LLMs
🔗 Poisoned Knowledge: When RAG Goes Wrong
🔓 Abliteration: Uncensoring LLMs Without Retraining
🕵️ X AI Bot Found! A Cautionary (and Funny) Tale

🥒 Pickle Rick Would Be Proud: Exploiting ML Models with Pickle Files

Researchers from Trail of Bits have uncovered a new hybrid machine-learning exploitation technique dubbed "Sleepy Pickle." This method takes advantage of the widely-used (and notoriously insecure) Pickle file format to compromise ML models themselves.

Highlights:

Sleepy Pickle can surreptitiously modify ML models to insert backdoors, control outputs, or tamper with processed data.
The attack leaves no trace on disk and is highly customizable, making it difficult to detect.
Demonstrated attacks include generating harmful outputs, stealing user data, and phishing users through manipulated model responses.

Read Part 1 of the Trail of Bits blog post

Read Part 2

🧠 Project Naptime: Google's Deep Dive into LLM Offensive Capabilities

Google's Project Zero team has been hard at work probing the limits of AI safety. Their "Project Naptime" initiative aims to automate vulnerability research using large language models.

Key findings:

Providing LLMs with specialized tools (like debuggers and interpreters) significantly enhances their ability to find vulnerabilities.
The project achieved up to 20x improvement on the CyberSecEval2 benchmark compared to previous approaches.

Read the full Project Zero blog post

Pair it with CyberSec Politics’ post called Automated LLM Bugfinders for another point of view on the same topic.

🗝️ Skeleton Key: A Universal Jailbreak for AI Models

Microsoft researchers have identified a new type of jailbreak attack they're calling "Skeleton Key." This technique can potentially bypass all responsible AI guardrails built into a model through its training.

Key points:

Skeleton Key uses a multi-turn strategy to cause a model to ignore its safety constraints.
Once successful, the model becomes unable to distinguish between malicious and sanctioned requests.
The attack was effective against multiple prominent AI models, including GPT-4, Claude 3, and others.

Microsoft has implemented mitigations in their AI offerings and provides guidance for developers using Azure AI services.

Read Microsoft's full blog post on Skeleton Key

This video by Mark Russinovich (CTO of Azure and creator of Sysinternals Suite) is very well worth a watch - he briefly talks about this attack and much more.

🥪 The Sandwich Attack: Multilingual Mayhem for LLMs

Researchers have introduced a novel attack vector called the "Sandwich Attack," which exploits the imbalanced representation of low-resource languages in multilingual LLMs.

How it works:

The attack creates a prompt with a series of five questions in different low-resource languages.
An adversarial question is hidden in the middle position.
This multilingual mixture can manipulate state-of-the-art LLMs into generating harmful responses.

Here’s an image from the paper (link below) that shows how the attack is done:

The study tested the attack on multiple prominent models, including GPT-4, Claude-3-OPUS, and Gemini Pro, demonstrating its effectiveness in bypassing safety mechanisms.

Read the full paper on the Sandwich Attack

My take: I tested the Sandwich Attack, aaaand it works. Maybe we should erase the problem by not having multilingual models in the first place? 🤔

🔗 Poisoned Knowledge: When RAG Goes Wrong

Researchers have identified a new vulnerability as Retrieval-Augmented Generation (RAG) becomes increasingly popular for enhancing LLM capabilities. The "Poisoned-LangChain" (PLC) attack demonstrates how malicious actors could exploit external knowledge bases to induce harmful behaviors in AI models.

Key findings:

PLC leverages poisoned external knowledge bases to interact with LLMs, causing them to generate malicious dialogues.
The attack achieved high success rates across multiple Chinese large language models.
This research highlights the need for robust security measures in RAG implementations.

Read the full paper on arXiv

My thoughts:
I’ve been trying to do something similar with Langchain but had no success so this was a nice surprise!
On another note, although no one can deny the importance of manual testing, it's also exciting to see a more systematic approach to testing LLMs in this context.

🔓 Abliteration: Uncensoring LLMs Without Retraining

A new technique called "abliteration" has been developed to remove censorship from language models without the need for retraining. This method identifies and removes the "refusal direction" within a model's residual stream.

Key points:

Abliteration can be applied to various open-source models, potentially uncensoring them.
The technique involves data collection, mean difference calculation, and selection of the best "refusal direction."

Read the full blog post on Hugging Face

While I haven't tested this technique myself, it's a nice read with some hands-on code. Another ++ for more systematic approaches to testing LLMs. One big positive outcome of developing these systematic methods is that we can quickly assess for the low-hanging fruit, and keep the more intense manual testing for the more complex scenarios.

🕵️ X AI Bot Found! A Cautionary (and Funny) Tale

A gif shared on X showcased an AI bot being discovered and manipulated to reveal its system prompt and generate funny content.

“Don’t believe everything you hear” is slowly turning into “Don’t believe everything you {read|see|hear|watch|feel|think|taste|smell|dream|decode from alien radio signals}”

See it here!

Call to Action

Join us next week as we continue to explore the cutting edge of AI and cybersecurity. Until then, stay curious and stay secure!

ContextOverflow is committed to fostering a deeper understanding of AI security. If you found this newsletter valuable, please consider sharing it with your network.

Context Overflow

Discussion about this post