Back to all articles

PHARE Benchmark Analysis: Disparities in LLM Safety and Robustness

New data including the PHARE LLM benchmark reveals significant variance in how large language models handle security challenges. This analysis highlights the efficacy of different alignment strategies and provides organizations with performance metrics and guide secure model selection.

Triage Security Media Team
3 min read

Recent data from Giskard’s second Potential Harm Assessment & Risk Evaluation (PHARE) benchmark report offers a detailed view of the current security environment for Large Language Models (LLMs). While the industry continues to advance model capabilities and revenue potential, the data suggests that improvements in safety and security controls are not keeping pace across the board.

The PHARE report evaluates models from major providers, including OpenAI, Anthropic, xAI, Meta, and Google—on critical safety metrics such as resistance to jailbreaks, hallucination rates, and bias mitigation. The findings indicate that while some providers are raising the bar, widespread progress remains limited.

Susceptibility to Adversarial Prompting

Security researchers have consistently demonstrated that chatbots can be manipulated into bypassing their safety guardrails. The PHARE data confirms that many models remain vulnerable not only to novel techniques but also to well-known adversarial prompts.

Researchers tested a wide range of models, including recent versions of GPT, Claude, Gemini, Deepseek, Llama, Qwen, Mistral, and Grok—against disclosed jailbreak methodologies. The results showed distinct tiers of performance:

  • GPT models generally successfully blocked these attempts between 66% and 75% of the time.

  • Gemini models (excluding the 3.0 Pro variant) typically defended against approximately 40% of attempts.

  • Deepseek and Grok showed significantly lower resistance, raising concerns about their suitability for environments requiring strict safety controls.

An analysis of model architecture reveals a counterintuitive trend: increased model size and sophistication do not correlate with improved robustness against jailbreaks. In some instances, smaller models successfully blocked prompts that compromised larger ones.

This phenomenon likely stems from the advanced reasoning capabilities of larger models. While they possess more knowledge, they are also better equipped to parse complex, multi-step role-playing scenarios or encoding schemes used in adversarial prompts. Smaller models, lacking the capacity to interpret these complex instructions, may default to a refusal state more often, inadvertently resulting in a safer outcome.

Matteo Dora, Chief Technology Officer at Giskard, notes that capability often brings complexity. "It's not directly proportional, but clearly with more capabilities you have more risks, because the [exposure] surface is much larger and the things that can go wrong increase," says Dora. He adds that while decoding prompts is one risk, capable models can also be more effective at misdirection or concealing objectives, increasing the burden on security teams to monitor outputs.

Performance Across Safety Metrics

Beyond jailbreaks, the benchmark evaluated resistance to prompt injection and the generation of misinformation.

  • Prompt Injection: Gemini (excluding 3.0 Pro) and Deepseek demonstrated resistance rates around 40% to 50%. Fifth-generation GPT models performed stronger, scoring above 80%.

  • Misinformation: GPT models maintained a passing standard, while other families struggled to consistently separate fact including hallucination.

One positive trend emerged across the industry: the refusal to generate overtly harmful content. Most models tested in the PHARE benchmark successfully declined and provide instructions for dangerous activities or criminal behavior. This suggests that basic safety filters regarding high-risk topics are functioning effectively across most platforms, particularly in newer reasoning models.

Evaluating Alignment Methodologies

The data identifies one model family that consistently outperforms peers across safety metrics: Anthropic’s Claude.

Claude 4.1 and 4.5 models resisted jailbreak attempts 75% to 80% of the time. The benchmark also recorded high performance for Claude in preventing harmful content generation, reducing hallucinations, and mitigating bias. The performance gap is significant enough that Anthropic’s data points visibly skew the industry average.

When plotted over time, the industry "r-line" (average progress) appears to show steady improvement. However, closer inspection reveals that Anthropic’s models are the primary drivers of this upward trend. If Anthropic’s data is removed, the industry trajectory for safety improvements flattens significantly, suggesting that most other providers are maintaining the status quo rather than advancing security standards.

The divergence in performance may be attributed to different engineering philosophies. Dora suggests that the difference lies in when safety is introduced in the development lifecycle.

"Anthropic has what they call the 'alignment engineers', people in charge of tuning, let's say, the personality and also the safety parts of the model's behavior. They embed it in all the training phases. They consider it part of the intrinsic quality of the model," Dora explains.

In contrast, other major providers, such as OpenAI, have historically treated alignment as a refinement step applied to the raw product at the end of the pipeline. "Performance gets bundled in the pipeline, and then there's this last step to refine the behavior," Dora notes. "So some people are saying these two different schools really lead to different results."

For organizations integrating LLMs into their infrastructure, understanding these architectural differences is vital. Models built with safety as an intrinsic component of training may offer a more resilient foundation for secure applications.

Sources & References