DEV Community 1h ago

Agent Execution Protocol v1.1 - A microkernel runtime for LLM agents with watchdog timers and ACID transactions

Problem

Current LLM agent frameworks treat the chat history as the single source of truth for state. This is architecturally equivalent to a kernel persisting its state only through stdin/stdout logs. It works temporarily, but predictably fails under load.

Three measurable failure modes:

Undetected execution loops - no watchdog. The agent re-runs write_file('config.json', data) because the confirmation fell out of context. Tokens burn until max_iterations.
Silent state corruption - LLM emits invalid JSON for a tool call. Some frameworks swallow it and proceed with null. Others abort. None roll back the file system. A half-written file persists.
Quadratic token cost - context grows every iteration (O(n²) attention). No budgeting, no signal before truncation.

These aren't bugs. They are architectural consequences of treating a probabilistic system (LLM) as a general-purpose deterministic machine. With documented tool-call hallucination rates of 2-5% (ToolAlpaca, API-Bank), relying on the model to self-manage state is untenable past ~50 tool calls.

The AEP Approach

Instead of state-in-context, we define a deterministic sandbox operated by a microkernel runtime with an 8-register address space (R0-R7):

Register	Function
R0	Program counter
R1	Watchdog timer (deadline, loop counter, state hash window)
R2	Context budget (tokens used, remaining)
R3	Sandbox state (content hash)
R4	Error register (structured stderr: code + payload)
R5	Schema registry (last tool + params)
R6	Transaction buffer (write-ahead log for rollback)
R7	Executive metadata (task_id, depth, cumulative tokens)

These registers do not live in the LLM context. They live in the runtime - Python, Rust, Go, whatever. The LLM only interacts with them through tool calls routed by the microkernel, never through chat messages. This decouples context (compressible) from operational state (exact).

Resilience Mechanics (AEP-0008)

Watchdog (R1): After every tool execution, the runtime hashes the sandbox. If hash == previous_hash, it increments a loop counter. When counter >= threshold (default 3), the task is ejected with WATCHDOG_LOOP. This catches cycles without state progress, not just call count.

ACID transactions (R4, R6): Every mutation passes schema validation. On violation:

Rollback via WAL replay (R6 restores previous sandbox)
Structured error injected in R4: {code, expected schema, received payload, recovery hint}
Runtime returns R4 as tool result - model parses and self-corrects
Three consecutive rollbacks on the same tool → watchdog abort

Net effect: invalid JSON never touches the filesystem. Corrupted state is reverted before any external process reads it.

Benchmark - Controlled Methodology

Pipeline: agent transforms 20 CSV spreadsheets (diverse schemas, mixed encoding, up to 15 columns) from natural language instructions. Baseline: same agent + same model (Claude 3.5 Sonnet, max_iterations=90) without AEP runtime. n=50 per arm, shuffled, temp=0.

Metric	Baseline	AEP Runtime	Delta
Tokens consumed (mean)	312,450	62,890	-79.9%
Schema accuracy (first call)	64%	86%	+22pp
Loop rate (>=3 cycles, no progress)	18%	0%	-18pp
Post-execution file corruption	2 cases	0 cases	-100%
Wall time (mean)	8m42s	2m13s	-74.5%

Methodology notes (read before citing):

95% CI for tokens: ±4.2% baseline, ±3.1% AEP
Only spreadsheet pipeline tested - no code gen, web scraping, or pure CoT data yet
Schema accuracy measures payload-passing validation, not semantic output correctness
Full fixture set + run script at benchmark/fixtures/ and benchmark/run_benchmark.sh

The -80% token reduction breaks down as: 55% from context compression (tool messages pruned after WAL confirm), 20% from loop elimination, 5% from fast rollback (1-2 iterations vs 5-8).

What Exists Today

Repo: https://github.com/ferreiratechnology2025-max/CogniX

The core spec (AEP-0001 through AEP-0012) is frozen at v1.1.0.

The Compliance Kit (compliance/) has 11 YAML tests:

Watchdog ejection on N cycles with no state delta
Rollback restores sandbox after invalid schema
R4 captures structured error with code
WAL persists before apply()
Task isolation in concurrent execution
Context budget ejection at limit
Watchdog bypass for idempotent tools
Rollback does not affect unrelated sandbox state
Forced hash collision behavior
WAL lock contention timeout
Independent benchmark reproducibility

What We're Asking the Community

Audit the spec (AEP-0001 through 0012 in spec/). If R6 transaction semantics don't match your use case, open an issue describing the gap.
Port the runtime to Rust or Go. The Python runtime is a POC; the spec is language-agnostic.
Run the benchmark independently. run_benchmark.sh takes ~3 minutes on commodity hardware.

The protocol is published. The tests are available. The engineering speaks for itself.

Read on DEV Community ↗ ← Back to News