Red Team Base and Instruct Models: Two Faces of the Same Threat


As generative AI systems become deeply embedded in enterprise workflows, understanding their security posture is no longer optional, it’s foundational. While most security evaluations today focus on instruct-tuned models (like ChatGPT or Claude), the real risk often lies deeper: in the base models they’re built on. Red teaming base models versus instruct models reveals a critical gap in AI safety. As models evolve, so do the threat surfaces. The question isn’t just “How do we red team?” but “Which version of the model are we testing?” This post explores the differences between the two, why both need to be tested, and what enterprises should consider when evaluating the true resilience of their AI stack.
Base Models: The Unfiltered Brain
Threat Model
- No safety alignment.
- Full access to model weights/API.
- Unbounded generation behavior.
Pros:
- Raw Behavior Visibility: Reveals unfiltered capabilities, biases, and unsafe completions the model can produce before any alignment layers are applied.
- Better for Jailbreak Discovery: Shows how vulnerable the model is without instruction-following safety layers, useful for indirect prompt injection research.
- Useful for Pre-Tuning Security Audits: Allows model developers to fix foundational issues before applying fine-tuning or reinforcement learning.
Cons:
- Not Representative of End-User Risk: Most real-world deployments use instruct-tuned models, so base model red-teaming may miss how the final product behaves.
- Difficult Prompt Design: Base models don’t follow instructions well, so crafting attack prompts becomes more of a guessing game.
Effective Attacks
- Direct prompts (no evasion needed).
- Capability probing (e.g., “How to…”).
Base models are what attackers use if weights are leaked or open-sourced
Instruct-Tuned Models: Polite but Breakable
Threat Model
- Refusal logic and safety alignment in place.
- Access via API or UI.
- Obedient—sometimes too obedient.
Pros:
- Closer to Production Risk: Reflects how the model behaves when deployed in chatbots, agents, and RAG systems.
- Better for Compliance & Safety Benchmarks: Aligns with regulatory frameworks (e.g., OWASP, NIST AI RMF) which require assessing deployed behavior.
- More Realistic Attacks: You can test jailbreaks, prompt injections, policy violations, and tool misuse under realistic usage patterns.
Cons:
- Obfuscated Root Causes: Failures may be masked by fine-tuning, making it harder to trace back to base model issues.
- Safety Illusions: Instruct models can “appear” safer due to refusal responses, but may still be manipulable with adversarial inputs.
- More Guardrails to Circumvent: Makes the red-teaming process slower and more complex (though often more meaningful).
Effective Attacks
- Framing (roleplay, hypotheticals).
- Obfuscation (spacing, language tricks).
- Meta-instruction overrides (“ignore previous instructions…”).
These models reflect real-world usage
Why Red Team Both?
Red teaming only one model misses half the picture:
Safety is safety. Whether the vulnerability stems from the raw model or slips past the alignment layer, it’s still a breach.
Enkrypt AI Insight: Fine-Tuning Can Undermine Safety
In our published research at Enkrypt AI, we fine-tuned an aligned model for a high-stakes use case: a security analyst assistant. When we red teamed the base version of that fine-tuned model, we found it had lost all safety alignment. It readily generated responses it previously refused. The domain-specific tuning unintentionally overwrote the model’s ethical constraints.
This finding matches broader research: fine-tuning—even on safe content—can erode safety behaviors and amplify vulnerabilities. It’s not enough to align once. Every change needs testing. https://arxiv.org/html/2404.04392v1
Conclusion
Don’t pick between red teaming the base model or the instruct-tuned version, do both. One reveals what the model can do, the other shows how well it’s prevented from doing it. This dual lens is critical to securing any AI system at scale.
If you’re building or deploying LLMs, ask yourself not just “Is my model safe?”—but “Is it still safe after tuning?”
And then, red team both to find out.
Want help building your red teaming pipeline? Contact Enkrypt AI to learn how we uncover and patch jailbreak vectors across foundation and fine-tuned models.