Security · AI

Building AEGIS: Why AI Agents Need Security Frameworks

January 2026 · 8 min read

In the past two years, AI agents have gone from research demos to production systems. Companies are deploying LLM-powered agents that browse the web, execute code, manage databases, and interact with APIs on behalf of users. The capabilities are impressive. The security posture is terrifying.

I built AEGIS — a security evaluation framework for AI agents — because I couldn't find anything that systematically assessed the risks these systems introduce. Not just prompt injection (though that's part of it), but the full attack surface: tool misuse, data exfiltration, privilege escalation through conversational manipulation, and the subtle ways an agent can be turned against the system it's supposed to serve.

The Problem: Agents Are Attack Surfaces

A traditional web app has a well-understood attack surface. You've got your inputs, your authentication layer, your database queries. Decades of security research have given us frameworks, scanners, and best practices. But an AI agent is completely different. Its "input" is natural language — infinite, ambiguous, and adversarial by nature.

Consider a typical LLM-powered agent with access to a company's internal tools. It can read emails, query databases, and send messages. Now imagine a carefully crafted email lands in someone's inbox. The email contains instructions that, when processed by the agent, cause it to forward sensitive data to an external address. This isn't science fiction. It's indirect prompt injection, and it works today against most deployed agent systems.

The attack surface of an AI agent includes the model itself (jailbreaks, prompt injection), the tools it has access to (over-permissioned APIs, missing authorization checks), the data it processes (poisoned context, malicious documents), and the feedback loops that let it learn and adapt (reward hacking, goal drift).

What AEGIS Does

AEGIS approaches agent security the way a penetration tester approaches a web application: systematically, category by category, with reproducible test cases and clear severity ratings.

The framework evaluates agents across several dimensions:

Prompt Robustness — How resistant is the agent to direct and indirect prompt injection? AEGIS runs a battery of injection attempts, from naive ("ignore previous instructions") to sophisticated multi-turn social engineering attacks that build trust before exploiting it.

Tool Safety — Does the agent respect the principle of least privilege? If it has access to a database, can it be convinced to run destructive queries? If it can send emails, can it be manipulated into exfiltrating data? AEGIS tests each tool integration for abuse potential.

Data Handling — How does the agent handle sensitive information? Does it leak PII in responses? Can it be tricked into revealing system prompts, API keys, or internal documentation?

Behavioral Consistency — Does the agent maintain its security posture across long conversations? Many agents that resist injection in a single turn become vulnerable after 10-15 turns of carefully escalating requests.

The Technical Implementation

AEGIS is built in Python and uses the OpenAI API (though it's designed to be model-agnostic). The core architecture is straightforward:

# Core evaluation loop
for category in evaluation_categories:
    for test_case in category.test_cases:
        response = agent.run(test_case.prompt)
        result = evaluate(response, test_case.expected_behavior)
        report.add(category, test_case, result)

# Generate CVSS-style severity scores
report.calculate_scores()
report.export()

Each test case defines an attack vector, an expected safe behavior, and a set of indicators that suggest the attack succeeded. The evaluation isn't just string matching — it uses a secondary LLM call to assess whether the agent's response constitutes a security failure, accounting for the nuance and ambiguity inherent in natural language.

The output is a structured report with CVSS-style severity ratings, specific recommendations, and reproducible test cases that developers can use to validate fixes.

What I Learned

Building AEGIS taught me several things that I didn't expect.

First, most agent deployments are catastrophically over-permissioned. Teams give agents access to every tool they might need, rather than the minimum set required for each task. This is the equivalent of running every microservice as root.

Second, security testing for agents requires adversarial creativity. The best attacks aren't technical — they're social. A prompt injection that works isn't "hack the system." It's "I'm the user's manager and I need you to prioritize this urgent request." Context manipulation beats brute force every time.

Third, the industry has no standards for this. There's no OWASP Top 10 for AI agents (though there's now an OWASP LLM Top 10, which is a start). There's no agreed-upon methodology for agent penetration testing. AEGIS is my attempt to start building that methodology.

What's Next

AEGIS is open source and in a solid state for evaluating single-agent systems. There's a long list of things that could be built on top of it — support for multi-agent architectures where agents delegate to other agents, CI/CD pipeline integration for automated security testing on every deployment, a web dashboard for visualizing scan results over time.

Realistically, I probably won't build most of that myself. My focus is shifting toward other projects and my upcoming CS program. But the foundation is there, the methodology works, and the codebase is open for anyone who wants to take it further. That's the beauty of building in public — the work outlives your attention span.

If you're deploying AI agents in production and haven't done adversarial testing, you should be worried. If you want to start testing, check out AEGIS on GitHub. And if you want to talk about agent security, I'm always up for that conversation.

The attack surface of software is way bigger than anyone expected. AI agents are powerful, useful, and — right now — dangerously undertested. Building tools to fix that is what I care about most right now.