Triage Security: Evaluating AI models for software dependency decisions

Organizations integrating AI models into their software dependency workflows should evaluate how these tools source and verify their upgrade recommendations.

Recent research from Sonatype evaluated the performance of "frontier" models, the most advanced AI models available—when tasked with providing upgrade and patching guidance for software dependencies. The data shows that while these tools offer productivity benefits, they frequently generate fabricated or inaccurate recommendations, complicating vulnerability management and potentially increasing technical debt.

To measure this, Sonatype’s research team analyzed 36,870 unique dependency upgrade recommendations across Maven Central, npm, PyPI, and NuGet between June and August 2025. The study encompassed a total of 258,000 recommendations generated by seven AI models from Anthropic, OpenAI, and Google.

The initial phase of the study, published in February 2026 as part of the State of the Software Supply Chain report, focused on OpenAI's GPT-5. The analysis found that the model often recommended software versions, upgrade paths, or security fixes that did not exist, with nearly 28% of the recommended dependency upgrades classified as hallucinations.

A second phase of the study evaluated newer models equipped with enhanced reasoning capabilities, including GPT-5.2, Anthropic's Claude Sonnet 3.7 and 4.5, Claude Opus 4.6, and Google's Gemini 2.5 Pro and 3 Pro. While these models showed measurable improvements, they continued to generate a significant volume of fabricated recommendations. According to the report, these failures can lead to wasted AI spend, diverted developer time, unresolved vulnerability exposure, and increased technical debt before code reaches production.

Evaluating recommendation accuracy

The research indicates that the primary limitation is not the reasoning capabilities of the models, which have advanced consistently. Instead, the models lack "ecosystem intelligence". The real-time dependency, vulnerability, compatibility, and enterprise policy context necessary to make safe remediation decisions.

Even the highest-performing models in the study fabricated approximately one out of every 16 dependency recommendations. To reduce hallucinations, frontier models often defaulted to a "no change" recommendation for about a third of the software components. However, this cautious approach resulted in the models failing to flag existing vulnerabilities. As a result, 800 and 900 critical and high-severity vulnerabilities were left unaddressed in production code during the evaluation.

In other instances, the models recommended software versions that contained known vulnerabilities. The report noted that this occasionally put the AI stack itself at elevated risk, as the libraries used to train, fine-tune, orchestrate, and serve the models were updated to vulnerable versions based on the models' own guidance.

Sonatype co-founder and CTO Brian Fox noted that inaccurate guidance from AI models creates a subtle accumulation of technical debt. While organizations generally expect AI models to make occasional errors, the research indicates that flaws in dependency recommendations are becoming quietly integrated into standard development workflows.

"The most dangerous version of this problem isn't when the model gives you something obviously broken," Fox said. "It's when it gives you something plausible that preserves risk, misses the better upgrade path, and looks close enough to ship."

Grounding AI with real-time intelligence

The data provides a clear path forward for organizations using AI-assisted development. The study demonstrated that "grounding" AI models with live intelligence and context produces significantly safer outcomes. When comparing the ungrounded frontier models to a hybrid approach that applied real-time intelligence at inference time, the hybrid method yielded a nearly 70% reduction in critical and high risks.

To test this methodology, researchers equipped GPT-5 Nano—the smallest model in the GPT-5 family, with a single function-calling tool backed by a version recommendation API. Supplying the model with ranked upgrade candidates, vulnerability counts, and developer trust scores led to a marked reduction in vulnerabilities compared to the ungrounded frontier models.

The report found that grounding not only prevents hallucinations but also successfully steers the model toward component versions with fewer known vulnerabilities when a perfect upgrade path is unavailable.

Without live registry data, vulnerability intelligence, or compatibility context, AI models will continue to output errors that require engineering time to correct. Simply adding a human review step to the process is unlikely to prevent these issues if the reviewer is relying on the same incomplete data. As Fox explained, humans should set policies and constraints, but the systems providing recommendations must remain grounded in real-time software intelligence to support safe, effective decision-making.