Behavioral Firewalls for AI Agents: Compiling Tool-Call Telemetry into a Finite Automaton

What the paper studies

LLM agents invoke external services through tool-call protocols like MCP. Today’s firewalls intercept these calls, validate schemas, and scan signatures. Each call is judged alone. The paper challenges that assumption. An adversary injecting instructions in a database row, web page, or tool output can direct an agent through individually valid calls, read a record, format a summary, send an email, that together exfiltrate data or escalate privilege. Praetor compiles verified benign call sequences into a parameterized finite automaton and blocks any call whose context or parameters fall outside the learned bounds.

Methodology

Praetor splits enforcement into an offline profiling stage and an online runtime stage.

During profiling, the system processes a corpus of verified benign execution traces. Each trace is a sequence of tool calls with their parameters. The profiler uses the previous w tool names as state (w=3 by default), prunes states seen fewer than θ times (θ=3), and records the allowed parameter ranges for each transition. Numeric parameters are stored as observed ranges plus a slack factor (ε_num=0.05). String parameters are grouped by semantic similarity, then each group’s bounds are widened by ε_str=0.05. This parameterized approach, adapted from finite automata theory, represents infinite parameter spaces as rules rather than enumerating each variant, which avoids state explosion.

At runtime, a sidecar intercepts tool calls between the agent and the environment. The gateway looks up the current state, checks for a matching tool transition, and verifies parameters against the bounds. Calls without a valid transition are halted and logged. The lookup runs in constant time regardless of the number of automaton states, trace length, or tool vocabulary size.

Evaluation uses Agent Security Bench across five scenarios, three structured workflows. The baseline is Aegis, a published stateless scanner combining PII redaction, regex checks, and anomaly detection. The authors generate 1,000 spliced exfiltration payloads, synthetic chains combining benign call fragments with malicious goals, to test how many chains pass both structural and parameter checks.

Two threat models are evaluated. In the black-box model the adversary knows generic firewall tactics but not the specific automaton or historical traces. In the gray-box model the adversary has partial access to benign traces and attempts to reconstruct constraints. Telemetry poisoning before profile compilation, DOS attacks on the gateway, and API calls bypassing the SDK layer are out of scope.

Findings

On the five-scenario ASB suite, Praetor achieves 5.6% macro-averaged attack success. On structured workflows alone, it falls to 2.2% versus 12.8% for Aegis.

On multi-step and context-sequential injection attacks, Praetor achieves 0% success in structured workflows. Stateless firewalls allow these same attacks at up to 75% because no single call in the sequence breaks schema or signature rules.

Against 1,000 spliced exfiltration payloads, 14 (1.4%) matched a valid structural path. All 14 failed string parameter guards (95% CI: [0%, 23.2%]). Synonym-substitution attacks against continuous bounds still succeed at 18%, which means exact-match whitelists on sensitive fields are needed.

Praetor adds 2.2 ms per call versus 8.1 ms for Aegis, a 3.7× speedup. The gateway performs no inference; runtime cost is hash lookups and embedding-distance checks against precomputed centroids.

At default settings (w=3, θ=3, ε=0.05), benign task failure is 2.0%. Most failures are legitimate calls with parameters outside the slack-widened bounds. Lower θ or higher ε reduces BTFR but admits more variant traces.

Limitations

The authors are explicit about scope. The cold-start problem remains: the profiling corpus must be verified clean, and an adversary who poisons telemetry before compilation defeats the system. DOS attacks on the gateway and API calls bypassing the SDK layer are also out of scope.

The system targets narrow-task agents with stable tools. Assistants with open-ended tasks, large tool catalogs, varied user inputs, or complex branching logic produce more diverse call patterns (higher behavioral entropy). At fixed θ, that increases the automaton state space and raises benign task failure.

The synonym-substitution result is the strongest weakness in scope. Continuous embedding bounds fail at 18% to paraphrase attacks, meaning sensitive string parameters (addresses, paths, SQL, URLs) need exact-match whitelists, not clustering alone.

The evaluation covers one benchmark (ASB) against one baseline (Aegis), so generalization to other stacks is unclear. Whenever tools or task templates change, the automaton must be recompiled; the paper doesn’t quantify the effort to update profiles when managing many agents in production. The audit-log soundness claim depends on host integrity assumptions that are stated but not verified.

Real-world application

For teams running tool-using agents in any environment with sensitive APIs, the findings translate into specific moves.

Move enforcement to the tool-call boundary. Schema validators and prompt filters can’t see sequential context. For structured workflows with fixed call patterns (read, summarize, route, write), encode the pattern explicitly and enforce at dispatch. MCP sidecars are the natural place. Stateless firewalls allow context-sequential attacks at 75% because they judge calls in isolation.

Capture clean-room telemetry before launch. The automaton is per-deployment. Run production agents through a sandboxed staging period to collect benign traces under known-good inputs. That corpus becomes the security model. Treat profile generation as a release artifact like the model checkpoint.

Set window width and pruning per agent. The defaults (w=3, θ=3, ε=0.05) yield 2.0% BTFR on ASB structured workflows. Agents with longer chains or higher branching need wider windows; agents with sparse traces need lower θ to preserve legitimate paths. Tune against measured BTFR on held-out benign workloads, not intuition.

Whitelist sensitive string parameters explicitly. When learning parameter bounds, the paper groups similar strings together. admin@example.com and admin2@example.com end up in the same cluster. At runtime, any string in the cluster passes, an attacker changes the value slightly but stays in the cluster and the check allows it. The paper found these synonym-substitution attacks succeed at 18%. For parameters controlling access (S3 bucket names, API endpoints, database table names, repository names, allowed git branches), hardcode the exact acceptable values instead. Don’t learn bounds from benign traces. Structural validation catches malformed calls. Hardcoded whitelists block these synonym attacks.

Track BTFR alongside ASR. A firewall that blocks 50% of legitimate work is worse than none, ops will disable it. Report both numbers in production. Alert on BTFR drift; failure rises when the agent’s call patterns diverge from the profile, exactly when you need profile updates.

Plan an incremental update path. New tools create unknown transitions. Build an offline profiling pipeline that injects incremental telemetry batches without full recompilation. Do this before you ship. Code-generation agents add new tool wrappers every release; the security profile must keep pace without blocking deployment.

Treat the audit log as evidence, not output. Halted calls go into a cryptographic log. Integrate that with your incident response pipeline and budget for retention.

For platform vendors. Firewall teams should publish per-call latency, BTFR, and attack categories caught. A fair comparison requires all three numbers, not just overall ASR.

For agent-driven code-generation and review. Tools that run long sequences (clone, branch, edit, test, PR, comment) are Praetor’s target. Profile the benign topology, lock down sensitive strings (allowed branches for force-push, registries for publish, hosts for fetch), and enforce at the tool layer, not the model.

References

Paper: https://arxiv.org/abs/2604.26274v1
Agent Security Bench (ASB): the benchmark used for all reported ASR and BTFR numbers.
Aegis stateless tool-call firewall (cited as the published baseline).
Forrest et al., short-sequence syscall intrusion detection, the host-IDS lineage Praetor adapts to LLM tool calling.
Symbolic Finite Automata (SFA) formalism, used to represent the pDFA with continuous parameter domains.
Model Context Protocol (MCP), the dispatch layer at which the gateway is deployed.