Agent security needs reproducible numbers, not anecdotes.
A tool-using agent fails differently from a chatbot. Most published evaluations treat the two the same — accuracy on a benchmark, a few hand-picked exploits, no shared apparatus. The result is that "this defense works" is a vibe, not a measurement.
Read any agent-security threadnought from the past year. The pattern repeats: a researcher demonstrates a clever prompt-injection trick, a vendor adds a filter, the next blog post shows the filter being bypassed by an obfuscation a junior would think of in five minutes. Nothing is comparable. No scenario is reproducible. Two defenses can't be ranked because there's no shared apparatus to run them through. The field is full of evidence that something happens, and almost no evidence about how often, against what, and with what variance.
AEGIS exists to make defense comparisons boring. Same scenarios. Same payloads. Same scoring. Different guard stack — and a number you can defend in a code review. The harness is deterministic on purpose: every run produces the same trace from the same config, so a delta between two runs is a defense delta, not noise. Variance comes from the experiment design, never from the apparatus.
The first version uses a deterministic demo runner instead of a live LLM. That is a feature, not a shortcut: it removes model variance from the experiment, exposes the guard logic on its own terms, and keeps the host machine safe while exploring tool-misuse scenarios. A live-LLM integration sits behind the same provider-agnostic interface and ships once the deterministic apparatus has earned trust.
Audience is narrow on purpose: security engineers evaluating agent platforms, ML safety researchers running comparative studies, and product teams who want to argue with their own numbers before shipping. The output isn't a verdict. It's evidence — typed, timestamped, JSONL.