Building Safer Generative AI from the Inside Out: Reinforce the Core, Not Just the Armor

Published on

May 27, 2025

Improving Model Safety Along with External Guardrails (API, LLM Guardrails) is also equally important

‍

Generative AI is sweeping into enterprise and society, bringing unprecedented capabilities — and new safety risks. In response, many organizations wrap AI models in layers of content filters, policy guidelines, and monitoring APIs. But a weak model in bulletproof armor is still a peanut — easily blown away by a breeze. No matter how strong the outer defenses, a brittle or misaligned core will eventually crack under pressure. Lasting safety must be engineered from the inside out, with robustness and alignment baked into the model itself, not just bolted on as an afterthought.

‍

The “Peanut in Armor” Problem: Outer Safeguards Aren’t Enough

‍

At first glance, surrounding an AI with strong guardrails — input filters, rule-based moderators, refuse-response triggers — seems like a solution. These outer safeguards are akin to bulletproof armor, designed to absorb or deflect harmful queries and outputs. They are important pieces of a safety strategy, catching many simple misuses. However, if the AI model underneath is fundamentally weak or misaligned, determined attackers (or even well-meaning users in tricky contexts) will find ways to pierce that armor. In practice, predefined rules and filters can be “easily bypassed by chained or multi-turn prompts” (medium.com), meaning that a conversation cleverly structured in steps can slip past the defenses. Think of a peanut hidden in a steel vault — if a breeze (i.e. an unforeseen input) finds a crack, the peanut is so light it blows right out. In AI terms, a sufficiently brittle core will eventually produce harmful or flawed outputs once someone figures out how to exploit its weaknesses.

‍

Real-world red-teaming has shown that even the most sophisticated AI guardrails cannot compensate for a poorly aligned base model. For instance, prompt-based attacks have achieved alarming success rates against top-tier models: multi-round jailbreaking methods boast over 90% success in bypassing safety filters with only a few dialog turns (medium.com). Similarly, adaptive attacks that encode malicious instructions (even in other languages or formats) can trick models into ignoring external safety instructions. One recent evaluation noted that simple content policies and keyword filters are effective only against naive attacks, and that more nuanced exploits easily evade them (medium.com). In enterprise settings — where stakes are high — relying on outer defense alone is like relying on an eggshell to protect the peanut inside. No amount of outer armor will save a fragile core.

‍

Why Safety Must Be Built In, From Architecture to Training

‍

True AI safety has to be engineered from the inside out. This starts with choices in the model’s architecture and design. Developers can incorporate safety at the architectural level, for example by sandboxing certain capabilities, adding interpretable layers, or designing modular systems where a “guardian” model oversees the primary model’s outputs. A well-designed architecture can make it easier to inject safety checks within the generation process, rather than solely as post-processing.

‍

The training data is even more critical. If you train a model on toxic, biased, or illicit material, you are effectively building a flawed core from the start. No bolt-on filter can reliably unteach what the model has learned. Data curation and preprocessing are thus fundamental: filtering out extreme hate speech, highly biased representations, or illegal content (like CSAM) from training data is a non-negotiable step. A striking example emerged recently when researchers discovered that the LAION dataset (used to train popular image generators) contained over 3,200 images of child sexual abuse material (pbs.org). Those images “made it easier for AI systems to produce realistic and explicit imagery of fake children” and even to transform photos of real teens into nudes (pbs.org). This kind of dangerous capability is baked into the model’s weights if such data isn’t rigorously scrubbed. Enterprises and policymakers must recognize that quality of training data = quality of model behavior. Demanding strong data curation — including bias audits and redaction of toxic or unlawful content — is a first step to safer AI cores.

‍

Beyond data, alignment tuning is essential to fortify the model’s core. Techniques like Reinforcement Learning from Human Feedback (RLHF) have proven effective at adjusting a model’s behavior to be more helpful and harmless. By training the model on demonstrations and feedback that emphasize ethical, non-toxic responses, developers essentially harden the core. Fine-tuning on safety-specific datasets (e.g. making the model refuse disallowed requests, or injecting a “constitution” of principles as done in Constitutional AI) further internalizes guardrails within the model. For example, OpenAI and others have reported that RLHF greatly reduces a model’s propensity to produce hate speech or encouragement of violence (cdn.openai.com). That said, RLHF is not a panacea — researchers caution that while it improves resilience, models remain vulnerable to novel attacks not covered in training (medium.com). In other words, alignment tuning must be continually updated and combined with other measures, but it’s still far better to have a model that tries to be safe than one that has no concept of safety until after it speaks.

‍

Finally, rigorous evaluation and red-team testing should be integral to model development. This means attacking your own model before bad actors do — probing it with the most diabolical prompts, looking for blind spots and failure modes. Safety shouldn’t be a one-time checklist, but an ongoing “societal harm” evaluation process. New benchmarks and leaderboards are emerging to quantify these risks. For instance, the DecodingTrust project introduced a comprehensive safety evaluation across eight dimensions (toxicity, bias, privacy, etc.) and found that even the best models have weaknesses (huggingface.co). (Notably, they observed that GPT-4 was more vulnerable in some safety tests than GPT-3.5 — a reminder that more capable doesn’t automatically mean more secure huggingface.co.) No single model today excels in all safety aspects (huggingface.co), which underscores why continuous testing and improvement of the core model is necessary. If a model shows high toxicity in certain scenarios or leaks private data when tricked, those issues must be addressed at the model level (e.g. by further tuning or architectural changes) rather than hoping an external filter catches it.

‍

Lessons from Red-Teaming: When the Core Is Brittle, Things Go Wrong

‍

What kinds of failures occur when a model’s core safety is lacking? Consider a few scenarios that enterprise AI teams and policymakers are increasingly worried about:

Jailbreaking and Unauthorized Instructions: If the base model has not truly learned to refuse dangerous or disallowed requests, users can exploit this. By using clever prompts or multi-step dialogues, adversaries have gotten models to output hate speech, extremist propaganda, even instructions for illicit activities. In fact, researchers recently demonstrated a prompt-based attack that succeeded in bypassing a leading model’s restrictions in 78% of trials (medium.com). Another approach used a small, openly released model to nudge a larger aligned model into misbehaving — achieving 99% success in making the larger model ignore its safety programming (medium.com). These numbers are alarming. They prove that if the core model isn’t robustly aligned, attackers will find a way around any outer filter. Enterprises deploying AI chatbots must assume that someone will eventually jailbreak them. The only reliable solution is to have a core that fundamentally does not want to produce harmful content, no matter how it’s prompted.
Illicit Content Generation (e.g. CSAM): As mentioned earlier, a model trained on problematic data can produce nightmarish outputs. If an image generator’s core model has seen thousands of CSAM images, it can learn patterns to recreate or synthesize such content. That puts companies at legal and moral risk, even if they deploy post-processing filters. History shows those filters can miss things — and once an image (or text) is generated, the damage is done. Similarly, an LLM that was trained on detailed instructions for making weapons or on violent extremist forums might regurgitate that knowledge on demand. We’ve already seen early base models willingly spew disallowed content before alignment was added (cdn.openai.com). The lesson is clear: preventative safety (keeping the sludge out of the training data, and actively training the model to avoid these behaviors) is far more effective than trying to clean up outputs after the fact. No enterprise wants a headline that their AI system generated illegal content. The only way to truly avoid that is to ensure the model itself is conscientiously constrained.
Bias and Discrimination: Misaligned base models tend to reflect the biases in their training data — often with subtle but harmful effects. An external filter might catch outright slurs, but it won’t catch latent bias in, say, how the model rates a person’s competence or describes different demographic groups. Recent research has shown that even advanced LLMs exhibit covert racial biases. For example, one study found that a language model evaluating job candidates consistently gave significantly lower aptitude and employability scores to candidates who used African American Vernacular English (AAVE), calling them “stupid” or “lazy,” and ranking them for poorer-paying jobs (theguardian.com). Such bias, if left in the model, could translate into discriminatory outcomes in hiring tools, customer service bots, or anywhere the AI interacts with human attributes. This is not a surface-level problem — it stems from the model’s internal representations. Only by aligning the core model’s values (through careful training and fine-tuning on balanced, bias-checked data) can we hope to root out these prejudices. Enterprises should demand that models pass robust fairness tests and bias evaluations as a prerequisite for deployment.

In all these cases — jailbreaks, illicit content, bias — the common thread is that a misaligned core will eventually undercut any facade of safety. We can add as well the issues of misinformation and hallucinations of harmful advice: a model that isn’t trained to prioritize truth and safety internally might confidently output false or dangerous information if asked. The outer layers (like a moderation API) might catch the obviously flagged words, but they can’t rewind and unsay a harmful falsehood that slipped through. As the saying goes, “garbage in, garbage out” — if the model’s core has garbage, the outputs will have garbage, somewhere, somehow (theguardian.com).

‍

Demanding Enterprise-Grade Safety at the Model Level

‍

For enterprises and policymakers, the takeaway is straightforward: insist on safety at the model level, not just the system level. If you are adopting a generative AI solution, ask hard questions about how the model itself was built for safety:

Was the training data vetted and filtered for extreme content and bias? (If not, expect the model to have learned some undesirable things.)
Has the model undergone alignment tuning (like RLHF or similar) specifically for harmlessness and ethical behavior? If so, what scenarios were covered — and how will new threats be addressed as they emerge?
What do independent evaluations say about the model’s safety? Look for participation in safety benchmarks or “red team” exercises. For example, the LLM Safety Leaderboard provides risk scores (covering jailbreak susceptibility, toxicity, bias, etc.) for various models (enkryptai.com). If a model ranks poorly, or if (worse) the vendor cannot provide such evaluation data, that’s a red flag. An enterprise-grade model should be able to show low percentages on jailbreak tests (meaning most attacks fail) and low toxicity/bias in standard benchmarks. Remember that even 10–20% successful attack rate is problematic when scaled to millions of queries — so aim for models that are pushing those numbers down continually (enkryptai.com).

Policymakers, too, should frame regulations around core model safety. Recent policy moves are indeed trending this way. The U.S. White House’s 2023 Executive Order on safe AI, and the EU’s upcoming AI Act, both emphasize rigorous pre-deployment risk assessment and alignment for “high-risk AI systems” (huggingface.co). These policies implicitly recognize that slapping a filter on a dangerous model is not enough; the AI itself must meet certain safety standards before it ever sees real users. We may soon see requirements for companies to conduct adversarial testing on their models and share the results (or even meet specific thresholds) as part of compliance. Such steps are welcome, because they shift the focus to reinforcing the core rather than just trust in a fancy wrapper.

‍

From Hollow Shells to Hardened Cores: A Call to Action

‍

The message to the AI industry is urgent but constructive: it’s time to fortify our models from within. Imagine generative AI as a fortress. You can build high walls and dig moats (the external safeguards), but if the center is empty or the foundation is cracked, the fortress won’t hold. Truly responsible deployment of generative AI means reinforcing the core, not just wrapping it in armor.

‍

For AI developers: prioritize safety in model architecture and training as early as possible. It should be as fundamental as optimizing accuracy or latency. Explore architectures that inherently reduce risk (such as two-step “think then answer” pipelines that can be audited, or models with built-in content understanding that can self-moderate to a degree). Invest in cleaning and curating data — it’s tedious, yes, but it directly translates to a safer model. Continuously apply alignment techniques and test their limits; when new jailbreaks or biases are found, treat it as a bug in the model to be fixed, not merely a PR problem to be papered over.

‍

For enterprises: demand transparency and rigor from AI vendors about model safety. If a provider offers you a powerful generative model but cannot demonstrate what’s been done to align it with your ethical standards and policies, think twice. Pushing for “show me the safety work” will incentivize the market to compete on core safety, not just capability. It will also protect your organization from downstream scandals — for example, an AI assistant that suddenly spews out a defamatory or biased remark can cause real brand and legal repercussions. Enterprise-grade AI means safety-grade AI at the core.

‍

For policymakers and regulators: continue to hold companies accountable for the behavior of their models, not just their disclaimers. Consider establishing independent auditing bodies or certifications for model safety (much like security certifications), which evaluate models on a suite of red-team attacks and harmful content generation tests. This would create a valuable seal of approval for models that are robustly aligned — a market signal that goes beyond marketing hype.

‍

In conclusion, the era of treating AI safety as an “add-on” is over. As generative AI moves from research labs into the fabric of our society, we can no longer afford hollow-center models parading in knight’s armor. The only way to ensure these systems are worthy of our trust is to bake safety into their very core — through thoughtful design, diligent training, and relentless testing. A peanut in armor is a disaster waiting to happen; a strong core, in contrast, will stand firm even when the armor is tested or breached. It’s time to fortify the heart of our AI models. The promise of generative AI is immense — but to realize it responsibly, we must make safety intrinsic to the model. No more peanuts in armor; let’s build the bulletproof peanut instead.

‍

Sources:

Xu, C., & Bo Li. “An Introduction to AI Secure LLM Safety Leaderboard.” Hugging Face Blog, 2023. (huggingface.co)
Ahmed, S. “Introduction to The Dark Art of LLM Jailbreaking.” Medium, 2023. (medium.com)
Ahmed, S. “Introduction to The Dark Art of LLM Jailbreaking.” Medium, 2023. (medium.com)
O’Brien, M., & Hadero, H. “AI image-generators are being trained on explicit photos of children.” PBS NewsHour/Associated Press, Dec 20, 2023.(pbs.org)
Enkrypt LLM Safety Leaderboard — Explanation of risk scores (2025) (enkryptai.com)
Milmo, D. “As AI tools get smarter, they’re growing more covertly racist, experts find.” The Guardian, Mar 16, 2024. (theguardian.com)

‍

Meet the Writer

Prashanth H

Building Safer Generative AI from the Inside Out: Reinforce the Core, Not Just the Armor

Improving Model Safety Along with External Guardrails (API, LLM Guardrails) is also equally important

The “Peanut in Armor” Problem: Outer Safeguards Aren’t Enough

Why Safety Must Be Built In, From Architecture to Training

Lessons from Red-Teaming: When the Core Is Brittle, Things Go Wrong

Demanding Enterprise-Grade Safety at the Model Level

From Hollow Shells to Hardened Cores: A Call to Action

Sources:

More articles

Red Team Base and Instruct Models: Two Faces of the Same Threat

America’s AI Action Plan: Racing to Stay Ahead

Safeguarding User Privacy in AI Applications: PII Testing and Protection with Enkrypt AI