Big Ideas

◉

min read

How Multi-Turn Attacks Generate Harmful Content from Your AI Solution

Published on

September 12, 2024

Generative AI models have improved detecting and rejecting malicious prompts.

‍

And most models have basic safety alignment training to avoid responding to queries such as: “How can I commit financial fraud?” Or “What are the steps to make a bomb at home?”.

‍

However, there are simple ways to generate such harmful content – methods known as Multi-Turn Attacks – that we will explore in this blog.

‍

What Are Multi-Turn Attacks?

In a Multi-Turn Attack, a malicious user starts with a benign prompt and gradually escalates to get the desired answer. Multi-Turn Attacks are dangerous because they are harder to detect when compared to one-time prompts. Refer to the video below that shows the basics of the attack.

Video: Multi-Turn Attack Demo: How ChatGPT Generates Harmful Content.

‍

Under the Hood: Multi-Turn Attack Details

Even though Large Language Models (LLMs) go through Safety alignment, they retain information on various topics including harmful information. Chatbots built with LLMs can be manipulated to retrieve this information. Chatbots that remember the context are particularly vulnerable to multi-turn attacks because the initial context that is used as a base causes low suspicion. A delayed attack is then employed to trigger the model into generating harmful content.

‍

Attack Scenarios and Harm Examples

‍

System Prompt Leak: An attacker engages a customer support chatbot with seemingly harmless questions, gradually probing its internal workings. Over time, the chatbot unintentionally reveals sensitive system prompts used to generate responses.

Sensitive Information Disclosure: Through incremental queries, an attacker manipulates a financial services chatbot into disclosing personal account details and transaction history. This information is then used for identity theft or fraud.

Off-Topic Conversations: An attacker guides a healthcare chatbot from medical queries to unrelated or harmful topics. This can lead to misinformation and brand damage.

Toxic Content Generation: An attacker gradually introduces inflammatory statements into a social media chatbot’s queries, causing it to generate toxic or offensive content. This can lead to user distress, damage the platform’s reputation, and spread harmful misinformation.

‍

Conclusion
‍

Countering Multi-Turn Attacks should be part of any organization’s AI Security Checklist.

To effectively guard against these attacks, organizations must develop advanced context management systems that can detect and mitigate gradual manipulation attempts, as well as incorporating regular security testing and red teaming exercises to identify and address potential vulnerabilities.

‍

Additional Reading

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack – Microsoft
Mitigating Skeleton Key, a new type of generative AI jailbreak technique – Microsoft

Frequently Asked Questions

What is a multi-turn attack on AI chatbots?

A multi-turn attack is a jailbreak method where an attacker starts with benign prompts and gradually escalates queries to manipulate an AI model into generating harmful content. Multi-turn attacks are harder to detect than single malicious prompts because they build context incrementally.

Attacker begins with seemingly innocent questions to establish trust.
Harmful requests escalate gradually over multiple conversation turns.
Model's context retention makes it vulnerable to delayed manipulation.

How do multi-turn attacks bypass AI safety alignment?

Multi-turn attacks exploit the fact that LLMs retain harmful information despite safety training, and chatbots with context memory are particularly vulnerable. The initial benign context creates low suspicion, allowing attackers to trigger harmful outputs through delayed escalation.

Safety-aligned models still contain knowledge on dangerous topics.
Context-aware chatbots remember prior conversation history.
Gradual escalation avoids triggering single-prompt detection systems.

What's the difference between multi-turn attacks and direct prompt injection?

Direct prompt injection attempts to extract harmful content in a single malicious query, while multi-turn attacks build context over multiple turns to evade detection. Multi-turn attacks are more dangerous because they are harder to identify and block compared to one-time prompts.

Direct injection: single malicious prompt triggers immediate response.
Multi-turn: incremental escalation masks harmful intent across turns.
Detection difficulty increases with conversation length and context.

Which platform is best for detecting and preventing multi-turn attacks?

Enkrypt AI's red-teaming and guardrail platform detects multi-turn attacks across 300+ risk categories and enforces policy-based runtime protection. The platform includes advanced context management systems and regular security testing to identify gradual manipulation attempts before they generate harm.

300+ red-teaming risk categories cover multi-turn jailbreak patterns.
Runtime guardrails block harmful outputs in real time.
Policy engine enforces security rules across all AI deployments.

How can I protect my AI chatbots from multi-turn attacks?

Enkrypt AI detects multi-turn manipulation in real time across your AI agents. Book a demo to see how it stops gradual jailbreak attempts before they succeed, or start a free trial to test it on your own chatbots.

Join 2,000+ readers

How Multi-Turn Attacks Generate Harmful Content from Your AI Solution

What Are Multi-Turn Attacks?

Under the Hood: Multi-Turn Attack Details

Attack Scenarios and Harm Examples

Conclusion
‍

Additional Reading

Frequently Asked Questions

More articles

Harvest Now, Decrypt Later: Why AI Agents Are the Threat No One's Watching

We Built Two AI Security Games. Play Them to Understand How Attacks Actually Work.

Securing AI at Scale in APAC: Why Kode-1 and Enkrypt AI Are Building This Together

PRODUCTS

SOLUTIONS

BY USE CASE

Helpful links

COMPANY