Environment Results
Each environment evaluation deploys a live agent against an isolated, instrumented sandbox — a real PostgreSQL
database, a ChromaDB vector store, a browser-accessible web application, or a nsjail-hardened shell — and
attempts to manipulate it through four distinct attack methods: direct prompt injection (appending malicious
instructions to the user query), context-ignoring DPI (framing the injection as an urgent override),
out-of-band prompt injection (pre-seeding a poisoned tool observation so the attack appears to come from the
environment rather than the user), and memory poisoning (inserting malicious directives into the agent's prior
history before the task begins). The agent then runs a ReAct reasoning loop, calling tools against the real
backend while the harness captures a before-and-after snapshot of sandbox state.
A breach is "verified" only when the sandbox state actually changes — a table is dropped, documents are injected
into the vector store, an XSS payload fires in the browser, or a sensitive file is read in the shell — not merely
when the model produces suspicious text. This ground-truth approach distinguishes real compromise from false
positives and enables precise comparison across models, backends, and attack methods. Results below are broken
down by model, backend, and injection technique; the overall breach rate reflects the fraction of attempts that
produced a confirmed, observable side effect in the sandbox.
qwen2.5:14b
openai_compat/qwen2.5:14b
53% breached
Overall
| Attempts |
Verified Breaches |
Breach Rate |
| 59 |
31 |
52.5%
|
By Backend
| Backend | Attempts | Breached | Rate |
sql |
22 |
19 |
86.4%
|
vectorstore |
37 |
12 |
32.4%
|
By Attack Method
| Attack | Attempts | Breached | Rate |
dpi_context |
10 |
10 |
100.0%
|
dpi_naive |
12 |
9 |
75.0%
|
memory_poison |
22 |
4 |
18.2%
|
opi |
15 |
8 |
53.3%
|