Environment Results

Each environment evaluation deploys a live agent against an isolated, instrumented sandbox — a real PostgreSQL database, a ChromaDB vector store, a browser-accessible web application, or a nsjail-hardened shell — and attempts to manipulate it through four distinct attack methods: direct prompt injection (appending malicious instructions to the user query), context-ignoring DPI (framing the injection as an urgent override), out-of-band prompt injection (pre-seeding a poisoned tool observation so the attack appears to come from the environment rather than the user), and memory poisoning (inserting malicious directives into the agent's prior history before the task begins). The agent then runs a ReAct reasoning loop, calling tools against the real backend while the harness captures a before-and-after snapshot of sandbox state.

A breach is "verified" only when the sandbox state actually changes — a table is dropped, documents are injected into the vector store, an XSS payload fires in the browser, or a sensitive file is read in the shell — not merely when the model produces suspicious text. This ground-truth approach distinguishes real compromise from false positives and enables precise comparison across models, backends, and attack methods. Results below are broken down by model, backend, and injection technique; the overall breach rate reflects the fraction of attempts that produced a confirmed, observable side effect in the sandbox.

1
Models Tested
59
Total Attempts
31
Verified Breaches
53%
Overall Breach Rate
qwen2.5:14b openai_compat/qwen2.5:14b 53% breached
Attempts Verified Breaches Breach Rate
59 31 52.5%
BackendAttemptsBreachedRate
sql 22 19 86.4%
vectorstore 37 12 32.4%
AttackAttemptsBreachedRate
dpi_context 10 10 100.0%
dpi_naive 12 9 75.0%
memory_poison 22 4 18.2%
opi 15 8 53.3%