Agent Execution Protocol v1.1 - A microkernel runtime for LLM agents with watchdog timers and ACID transactions
Problem
Current LLM agent frameworks treat the chat history as the single source of truth for state. This is architecturally equivalent to a kernel persisting its state only through stdin/stdout logs. It works temporarily, but predictably fails under load.
Three measurable failure modes:
- Undetected execution loops - no watchdog. The agent re-runs
write_file('config.json', data)because the confirmation fell out of context. Tokens burn untilmax_iterations. - Silent state corruption - LLM emits invalid JSON for a tool call. Some frameworks swallow it and proceed with
null. Others abort. None roll back the file system. A half-written file persists. - Quadratic token cost - context grows every iteration (O(nยฒ) attention). No budgeting, no signal before truncation.
These aren't bugs. They are architectural consequences of treating a probabilistic system (LLM) as a general-purpose deterministic machine. With documented tool-call hallucination rates of 2-5% (ToolAlpaca, API-Bank), relying on the model to self-manage state is untenable past ~50 tool calls.
The AEP Approach
Instead of state-in-context, we define a deterministic sandbox operated by a microkernel runtime with an 8-register address space (R0-R7):
| Register | Function |
|---|---|
| R0 | Program counter |
| R1 | Watchdog timer (deadline, loop counter, state hash window) |
| R2 | Context budget (tokens used, remaining) |
| R3 | Sandbox state (content hash) |
| R4 | Error register (structured stderr: code + payload) |
| R5 | Schema registry (last tool + params) |
| R6 | Transaction buffer (write-ahead log for rollback) |
| R7 | Executive metadata (task_id, depth, cumulative tokens) |
These registers do not live in the LLM context. They live in the runtime - Python, Rust, Go, whatever. The LLM only interacts with them through tool calls routed by the microkernel, never through chat messages. This decouples context (compressible) from operational state (exact).
Resilience Mechanics (AEP-0008)
Watchdog (R1): After every tool execution, the runtime hashes the sandbox. If hash == previous_hash, it increments a loop counter. When counter >= threshold (default 3), the task is ejected with WATCHDOG_LOOP. This catches cycles without state progress, not just call count.
ACID transactions (R4, R6): Every mutation passes schema validation. On violation:
- Rollback via WAL replay (R6 restores previous sandbox)
- Structured error injected in R4:
{code, expected schema, received payload, recovery hint} - Runtime returns R4 as tool result - model parses and self-corrects
- Three consecutive rollbacks on the same tool โ watchdog abort
Net effect: invalid JSON never touches the filesystem. Corrupted state is reverted before any external process reads it.
Benchmark - Controlled Methodology
Pipeline: agent transforms 20 CSV spreadsheets (diverse schemas, mixed encoding, up to 15 columns) from natural language instructions. Baseline: same agent + same model (Claude 3.5 Sonnet, max_iterations=90) without AEP runtime. n=50 per arm, shuffled, temp=0.
| Metric | Baseline | AEP Runtime | Delta |
|---|---|---|---|
| Tokens consumed (mean) | 312,450 | 62,890 | -79.9% |
| Schema accuracy (first call) | 64% | 86% | +22pp |
| Loop rate (>=3 cycles, no progress) | 18% | 0% | -18pp |
| Post-execution file corruption | 2 cases | 0 cases | -100% |
| Wall time (mean) | 8m42s | 2m13s | -74.5% |
Methodology notes (read before citing):
- 95% CI for tokens: ยฑ4.2% baseline, ยฑ3.1% AEP
- Only spreadsheet pipeline tested - no code gen, web scraping, or pure CoT data yet
- Schema accuracy measures payload-passing validation, not semantic output correctness
- Full fixture set + run script at
benchmark/fixtures/andbenchmark/run_benchmark.sh
The -80% token reduction breaks down as: 55% from context compression (tool messages pruned after WAL confirm), 20% from loop elimination, 5% from fast rollback (1-2 iterations vs 5-8).
What Exists Today
Repo: https://github.com/ferreiratechnology2025-max/CogniX
The core spec (AEP-0001 through AEP-0012) is frozen at v1.1.0.
The Compliance Kit (compliance/) has 11 YAML tests:
- Watchdog ejection on N cycles with no state delta
- Rollback restores sandbox after invalid schema
- R4 captures structured error with code
- WAL persists before
apply() - Task isolation in concurrent execution
- Context budget ejection at limit
- Watchdog bypass for idempotent tools
- Rollback does not affect unrelated sandbox state
- Forced hash collision behavior
- WAL lock contention timeout
- Independent benchmark reproducibility
What We're Asking the Community
- Audit the spec (AEP-0001 through 0012 in
spec/). If R6 transaction semantics don't match your use case, open an issue describing the gap. - Port the runtime to Rust or Go. The Python runtime is a POC; the spec is language-agnostic.
- Run the benchmark independently.
run_benchmark.shtakes ~3 minutes on commodity hardware.
The protocol is published. The tests are available. The engineering speaks for itself.
Comments
No comments yet. Start the discussion.