AI security red-teaming competitions – in which participants compete to develop new attacks against AI models and defenses – provide a unique way to assess how secure today’s AI systems are in the face of adversarial pressure. CAISI recently partnered with Gray Swan, the UK AI Security Institute (UK AISI), and several frontier AI labs to publish a new research paper based on data from a large-scale public AI agent red-teaming competition, revealing several insights into the robustness of current leading AI models.
Extract authors, key findings, references, and an executive summary using AI.
Large Language Model (LLM) agents are increasingly autonomous and integrated into high-stakes environments, creating new security vectors. This paper reports on a major public red-teaming competition focused on 'indirect prompt injection' attacks, where adversaries hide malicious instructions in external data processed by agents. A crucial requirement for these attacks is concealment—the ability to perform unauthorized actions while appearing to function normally in the final user-facing response. The competition involved 464 participants who submitted over 271,000 attacks against 13 frontier models, resulting in 8,648 successful compromises. The researchers identified universal attack templates that generalize across different models and behaviors, suggesting that current instruction-following training processes possess fundamental vulnerabilities. Overall, models in the Claude and GPT families displayed higher robustness, while models like Gemini 2.5 Pro exhibited higher vulnerability in complex settings such as computer use. Statistical analysis reveals that robustness is tied more closely to model family and training recipes than to general capabilities. The findings underline that current defense mechanisms are inadequate, as even robust models are not immune to these attacks. The researchers provide a new, recurrent benchmark and call for a shift toward system-level and architectural defenses that can isolate untrusted inputs from agent control flows.
LLM based agents are increasingly deployed in high stakes settings where they process external data sources such as emails, documents, and code repositories. This creates exposure to indirect prompt injection attacks, where adversarial instructions embedded in external content manipulate agent behavior without user awareness. A critical but underexplored dimension of this threat is concealment: since users tend to observe only an agent’s final response, an attack can conceal its existence by presenting no clue of compromise in the final user facing response while successfully executing harmful actions. This leaves users unaware of the manipulation and likely to accept harmful outcomes as legitimate. We present findings from a large scale public red teaming competition evaluating this dual objective across three agent settings: tool calling, coding, and computer use. The competition attracted 464 participants who submitted 272,000 attack attempts against 13 frontier models, yielding 8,648 successful attacks across 41 scenarios. All models proved vulnerable, with attack success rates ranging from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro). We identify universal attack strategies that transfer across 21 of 41 behaviors and multiple model families, suggesting fundamental weaknesses in instruction following architectures. Capability and robustness showed weak correlation, with Gemini 2.5 Pro exhibiting both high capability and high vulnerability. To address benchmark saturation and obsoleteness, we will endeavor to deliver quarterly updates through continued red teaming competitions. We open source the competition environment for use in evaluations, along with 95 successful attacks against Qwen that did not transfer to any closed source model. We share model-specific attack data with respective frontier labs, and the full dataset with the UK AISI and US CAISI to support robustness research.
1.All 13 evaluated frontier models proved vulnerable to indirect prompt injection attacks.
2.Attack success rates (ASR) ranged from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro).
3.A total of 8,648 successful attacks were recorded across 41 distinct scenarios.
The authors discuss the susceptibility of all evaluated models to indirect prompt injection, emphasizing that even the most robust models remain vulnerable. They highlight the need for system-level and architectural defenses rather than solely relying on model-level robustness training. Future directions include exploring the effects of 'thinking' mode and chain-of-thought monitoring on robustness, improving the realism of evaluation scenarios (especially for tool-use), and developing multi-shot evaluation processes to ensure stability.