Back to Blogs
CONTENT
This is some text inside of a div block.
Subscribe to our newsletter
Read about our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Industry Trends

Why LLM Safety Leaderboards Matter: Shortcomings of Azure Foundry’s Safety Scores

Published on
July 16, 2025
4 min read

Generative AI is everywhere now. From customer service chatbots to code generators, LLMs are making decisions that affect millions of people daily. But here’s the problem: we’re still figuring out how to measure if these systems are actually safe.

That’s where LLM safety leaderboards come in. These platforms promise to evaluate and rank AI models based on how well they avoid harmful outputs. It sounds great in theory, but the reality is more complicated.

The High Stakes of AI Safety

Traditional software testing focuses on whether code works correctly and runs fast. But LLMs introduce entirely new risks. They can generate harmful content, show bias, leak sensitive information, or get tricked into bypassing safety measures.

The consequences are real. A biased customer service bot doesn’t just provide bad service ; it can destroy a company’s reputation and create legal problems. A code generator that produces insecure code can compromise entire systems. A mental health support tool that misses signs of self-harm could cost lives.

Safety leaderboards tackle these issues by providing standardized ways to evaluate models. They let us compare different models fairly, encourage developers to prioritize safety, give organizations transparency when choosing models, and set benchmarks that push the whole industry forward.

But here’s the catch: these leaderboards are only as good as what they actually test. A leaderboard that gives false confidence in a model’s safety can be more dangerous than no evaluation at all.

Azure AI Foundry’s Current Approach

Azure AI Foundry’s Model Safety Leaderboard is a solid step forward. It evaluates models across multiple areas, combining quality, safety, cost, and performance metrics to help organizations make smarter deployment decisions.

The safety evaluation covers three main areas:

HarmBench tests for harmful behaviors in three categories: standard harmful behaviors (cybercrime, illegal activities, general harm), contextually harmful behaviors (harassment, bullying), and copyright violations. It uses direct prompts without attack strategies and measures Attack Success Rate (ASR), where lower numbers mean safer models.

Toxigen checks if a model can detect toxic content using a dataset covering 13 minority groups. It calculates F1 scores for classification performance, with higher scores meaning better detection. Note that this measures detection ability, not generation resistance — a model that creates toxic content might still score well if it can identify toxicity in other text.

WMDP (Weapons of Mass Destruction Proxy) tests model knowledge in sensitive areas like biosecurity, cybersecurity, and chemical security. Higher accuracy scores indicate more dangerous knowledge, which is worse from a safety standpoint.

The platform does acknowledge important limitations. It notes that safety is complex and multidimensional, and no single benchmark can represent full system safety. Many benchmarks suffer from saturation or misalignment between design and risk definition, potentially leading to overestimating or underestimating real-world safety performance.

This represents meaningful progress, but a deeper look reveals substantial gaps that limit the leaderboard’s effectiveness.

Where the Leaderboard Falls Short

1. Major Coverage Gaps

The biggest issue is what the leaderboard doesn’t test. Several high-impact risk areas are completely missing, creating dangerous blind spots.

Demographic bias and fairness is the most obvious gap. A model can get high safety scores while consistently reinforcing harmful stereotypes about race, gender, religion, or socioeconomic status. This creates both reputation and legal risks for organizations deploying customer-facing systems.

Self-harm and suicide scenarios are entirely absent. As AI systems increasingly serve as first contact points for users in distress, this gap could have life-threatening consequences, especially for mental health applications.

Other missing areas include regulated substances, weapons, criminal planning, insecure code generation, and more. The Toxigen evaluation has a paradox where a model could generate toxic content while scoring well on detection, creating false confidence.

2. Unrealistic Adversarial Testing

The leaderboard’s approach to adversarial testing doesn’t match how real attacks happen. All evaluations use direct prompts only, explicitly excluding jailbreak attempts, prompt injection, chain-of-thought leaks, and style-transfer evasions that bad actors actually use.

Single-turn evaluation misses the dynamic nature of real attacks, where adversaries use multi-turn conversations to gradually build trust or establish context. Inconsistent safety filter application (disabled for HarmBench/Toxigen but enabled for WMDP) creates non-comparable results and potential blind spots.

The evaluation completely ignores data poisoning, model poisoning, and improper output handling, despite these being highlighted in the OWASP LLM Top 10 as critical vulnerabilities.

3. Inadequate Metrics

The current measurement approach treats all safety violations equally through unweighted Attack Success Rate, failing to distinguish between mild policy violations and severe harmful instructions. Without severity tiers or failure explanations, organizations can’t understand the nature of model weaknesses or implement appropriate fixes.

Static public datasets create artificial score inflation over time as models are optimized for these specific benchmarks, potentially hiding real-world fragility. The absence of confidence intervals means small ranking differences may be statistically meaningless noise.

4. Enterprise Adoption Barriers

The leaderboard lacks mapping to established frameworks like NIST AI RMF, ISO 42001, or OWASP LLM Top 10, forcing security teams to manually translate results. The “preview-only” disclaimer with no SLA undermines reliability for production decisions.

Inconsistent safety filter configurations mean organizations can’t use the leaderboard setup as a turnkey policy implementation, reducing standardization benefits.

Better Solutions Emerging

The limitations of current safety leaderboards haven’t gone unnoticed. Enkrypt AI has developed a safety leaderboard that directly addresses many of these shortcomings. Their evaluation framework covers five critical test categories that map directly to established risk frameworks.

Bias testing evaluates harmful stereotypes and discriminatory behavior across protected attributes including race, gender, religion, health, socioeconomic status, family structure, and literacy. Harmful content evaluation addresses the critical gaps in self-harm, suicide promotion, extremism, hate speech, sexual content, regulated substances, weapons, and criminal planning. Toxicity assessment goes beyond simple detection to measure nuanced categories including identity attacks, profanity, severe toxicity, sexually explicit content, insults, threats, and flirtation.

Most importantly, Enkrypt AI includes insecure code generation testing that checks for vulnerabilities like injection flaws, hardcoded secrets, and poor input sanitization. Their CBRN (Chemical, Biological, Radiological, Nuclear) evaluation specifically tests assistance with weapons of mass destruction synthesis and deployment, addressing real-world proliferation threats.

Each test category explicitly maps to both NIST AI Risk Management Framework categories and OWASP LLM Top 10 vulnerabilities, providing the governance alignment that enterprise organizations need. This comprehensive approach shows that more thorough safety evaluation is not only possible but increasingly necessary as AI systems become more capable and widespread.

The Path Forward

These shortcomings don’t mean we should give up on standardized safety evaluation. Instead, they highlight the need for more comprehensive approaches. Effective safety leaderboards must address the full spectrum of AI risks, incorporate realistic adversarial testing, provide sophisticated metrics with severity weighting, and align with established governance frameworks.

Most importantly, they must be transparent about limitations and position themselves as part of a broader safety ecosystem rather than complete solutions. Organizations need to understand where additional testing and validation are required.

The Bottom Line

Azure AI Foundry’s Model Safety Leaderboard represents important progress in AI safety evaluation. It shows the industry’s growing recognition that safety must be measured systematically. However, current safety leaderboards provide only a thin slice of the comprehensive evaluation needed for safe AI deployment.

For organizations considering LLM deployment, these leaderboards offer valuable insights but can’t serve as the sole basis for safety decisions. They provide a useful starting point that must be supplemented with comprehensive bias audits, adversarial red-team exercises, code security assessments, and explicit policy mappings.

The future of AI safety depends on building evaluation frameworks that match the sophistication of the systems we’re assessing. While current leaderboards fall short of this goal, they represent crucial steps toward a more transparent and accountable AI ecosystem. The challenge now is to build on these foundations while addressing their fundamental limitations, ensuring that safety evaluation keeps pace with rapid AI advancement.

The goal shouldn’t be perfect safety evaluation, but rather comprehensive, honest, and continuously improving assessment that gives organizations the information they need to make informed decisions about AI deployment. Only through such rigorous evaluation can we hope to realize AI’s benefits while minimizing its risks.

About Enkrypt AI

Enkrypt AI helps companies build and deploy generative AI securely and responsibly. Our platform automatically detects, removes, and monitors risks like hallucinations, privacy leaks, and misuse across every stage of AI development. With tools like industry-specific red teaming, real-time guardrails, and continuous monitoring, Enkrypt AI makes it easier for businesses to adopt AI without worrying about compliance or safety issues. Backed by global standards like OWASP, NIST, and MITRE, we’re trusted by teams in finance, healthcare, tech, and insurance. Simply put, Enkrypt AI gives you the confidence to scale AI safely and stay in control.

Meet the Writer
Nitin Birur
Latest posts

More articles

Industry Trends

Vibe Coding and the Velocity of AI Development: Are We Moving Faster Than Trust?

Discover how vibe coding accelerates AI-powered software development — and why trust, security, and governance must scale just as fast. Explore Enkrypt AI's solutions.
Read post
Industry Trends

Red Teaming OpenAI Help Center – Exploiting Agent Tools and Confusion Attacks

Discover how tool name exploitation poses a universal security threat across vanilla, guardrailed, and production AI agent systems. Learn why current AI security measures fall short and explore urgent calls for improved authorization and communication protocols to safeguard AI ecosystems.
Read post
Industry Trends

The Clock is Ticking: EU AI Act's August 2nd Deadline is Almost Here

The EU AI Act’s key compliance deadline on August 2, 2025, marks a major shift for AI companies. Learn how this date sets new regulatory standards for AI governance, affecting general-purpose model providers and notified bodies across Europe. Prepare now for impactful changes in AI operations.
Read post