• Xeuron logo
Discover
  • Home
  • Popular
  • Hot & Trending
  • Explore
  • My Extractions
Create
  • SubXeurons
    • iPSC-Cardio Cells
    • HALO: A Unified Visio
  • Publications
    • Adoption and Use of LLMs at an Academic Medical Center
    • Toward AI-Driven Digital Organism
    • You Can Run, You Can Hide: The Epidemiology and Statistical Mechanics of Zombies
    • embryonic stem cell-derived cardiac organoids via synthetic guidance
    • In vitro generation of human pluripotent stem cell derived lung organoids
    • Generating Self-Assembling Human Heart Organoids Derived from Pluripotent Stem Cells
    • SMAD4: A Critical Regulator of Cardiac Neural Crest Cell Fate and Vascular Smooth Muscle Differentiation. bioRxiv
    • Insights into AI Agent Security from a Large-Scale Red-Teaming Competition
    • TxPert: using multiple knowledge graphs for prediction of transcriptomic perturbation effects
    • Self-organizing human heart assembloids with autologous and developmentally relevant cardiac neural crest-derived tissues
    • Path Planning of Cleaning Robot with Reinforcement Learning
    • Reinforcement Learning Approaches in Social Robotics
    • Robotic Packaging Optimization with Reinforcement Learning
    • A Concise Introduction to Reinforcement Learning in Robotics
    • Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics
    • Robotic Surgery With Lean Reinforcement Learning
    • Residual Reinforcement Learning for Robot Control
    • Autonomous robotic nanofabrication with reinforcement learning
    • Heterogeneous Multi-Robot Reinforcement Learning
    • Robot Air Hockey: A Manipulation Testbed for Robot Learning with Reinforcement Learning
    • Reinforcement learning for freeform robot design
    • Geometric Reinforcement Learning For Robotic Manipulation
    • On-Robot Bayesian Reinforcement Learning for POMDPs
    • Efficient Content-Based Sparse Attention with Routing Transformers
    • A foundation model of transcription across human cell types
    • Transformer AI
    • HALO, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidan
    • HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning
  • Events
    • No events yet
HomeSearchEventsProfileCreate

Insights into AI Agent Security from a Large-Scale Red-Teaming Competition

arXiv:10.9999/xeuron2026051115540015[2026]
u/susan15383·DOI·Source·PDF|

AI Summary

AI security red-teaming competitions – in which participants compete to develop new attacks against AI models and defenses – provide a unique way to assess how secure today’s AI systems are in the face of adversarial pressure. CAISI recently partnered with Gray Swan, the UK AI Security Institute (UK AISI), and several frontier AI labs to publish a new research paper based on data from a large-scale public AI agent red-teaming competition, revealing several insights into the robustness of current leading AI models.

AI Metadata Extraction

Extract authors, key findings, references, and an executive summary using AI.

Version:· 2 versions extracted
Extraction v2google/gemini-3.1-flash-lite5/12/2026

Executive Summary

Large Language Model (LLM) agents are increasingly autonomous and integrated into high-stakes environments, creating new security vectors. This paper reports on a major public red-teaming competition focused on 'indirect prompt injection' attacks, where adversaries hide malicious instructions in external data processed by agents. A crucial requirement for these attacks is concealment—the ability to perform unauthorized actions while appearing to function normally in the final user-facing response. The competition involved 464 participants who submitted over 271,000 attacks against 13 frontier models, resulting in 8,648 successful compromises. The researchers identified universal attack templates that generalize across different models and behaviors, suggesting that current instruction-following training processes possess fundamental vulnerabilities. Overall, models in the Claude and GPT families displayed higher robustness, while models like Gemini 2.5 Pro exhibited higher vulnerability in complex settings such as computer use. Statistical analysis reveals that robustness is tied more closely to model family and training recipes than to general capabilities. The findings underline that current defense mechanisms are inadequate, as even robust models are not immune to these attacks. The researchers provide a new, recurrent benchmark and call for a shift toward system-level and architectural defenses that can isolate untrusted inputs from agent control flows.

Authors (9)

Mateusz DziemianFirst Author

Gray Swan AI

Maxwell Lin

Gray Swan AI

Xiaohan Fu

Gray Swan AI

Abstract

LLM based agents are increasingly deployed in high stakes settings where they process external data sources such as emails, documents, and code repositories. This creates exposure to indirect prompt injection attacks, where adversarial instructions embedded in external content manipulate agent behavior without user awareness. A critical but underexplored dimension of this threat is concealment: since users tend to observe only an agent’s final response, an attack can conceal its existence by presenting no clue of compromise in the final user facing response while successfully executing harmful actions. This leaves users unaware of the manipulation and likely to accept harmful outcomes as legitimate. We present findings from a large scale public red teaming competition evaluating this dual objective across three agent settings: tool calling, coding, and computer use. The competition attracted 464 participants who submitted 272,000 attack attempts against 13 frontier models, yielding 8,648 successful attacks across 41 scenarios. All models proved vulnerable, with attack success rates ranging from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro). We identify universal attack strategies that transfer across 21 of 41 behaviors and multiple model families, suggesting fundamental weaknesses in instruction following architectures. Capability and robustness showed weak correlation, with Gemini 2.5 Pro exhibiting both high capability and high vulnerability. To address benchmark saturation and obsoleteness, we will endeavor to deliver quarterly updates through continued red teaming competitions. We open source the competition environment for use in evaluations, along with 95 successful attacks against Qwen that did not transfer to any closed source model. We share model-specific attack data with respective frontier labs, and the full dataset with the UK AISI and US CAISI to support robustness research.

Key Findings (20)

1.All 13 evaluated frontier models proved vulnerable to indirect prompt injection attacks.

2.Attack success rates (ASR) ranged from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro).

3.A total of 8,648 successful attacks were recorded across 41 distinct scenarios.

Discussion & Future Directions

The authors discuss the susceptibility of all evaluated models to indirect prompt injection, emphasizing that even the most robust models remain vulnerable. They highlight the need for system-level and architectural defenses rather than solely relying on model-level robustness training. Future directions include exploring the effects of 'thinking' mode and chain-of-thought monitoring on robustness, improving the realism of evaluation scenarios (especially for tool-use), and developing multi-shot evaluation processes to ensure stability.

References (67)

  1. [1]AI Village. (2025). Generative red team 3 (GRT3). DEF CON 33.
    Create publication
  2. [2]Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, J. Z., Fredrikson, M., Gal, Y., & Davies, X. (2025). Agentharm: A benchmark for measuring harmfulness of LLM agents. The Thirteenth International Conference on Learning Representations.
    Create publication
  3. [3]Anthropic. (2024). Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku.
    Create publication

Sections

Executive SummaryAuthorsAbstractKey FindingsDiscussionReferences