I tested the 'deterministic agent loop' claims with four experiments. They all failed - including my own fix.
Experiment 1: Lexical Overlap ≠ Semantic Understanding
Mid-loop on turn 5 the user interjects: "actually, change it to X." Is this an addendum to the old task, or a brand-new task? The proposed fix: compute a "lexical overlap" score with two fixed thresholds - ≥0.24 means same task, ≤0.08 means new task, with the middle sent to the LLM. The claim is "80% decided by code, instantly."
Sounds engineering-grade. But lexical overlap reads characters, not meaning. I built 30 labeled pairs, applied its thresholds, ran three tokenizers. Result: 50% hard misclassification.
The worst cases:
- Current task: "continue writing the loop-engine article" / User interjects: "delete the loop-engine article" - Overlap 0.615 → judged same task. The user said delete; the engine decides "same as writing," and keeps writing. A reverse operation is treated as a continuation. This is incident-grade.
- Current task: "fix the checkout bug" / User interjects: "the payment page is throwing, can you look" - Overlap 0.000 → judged new task. Any human sees one task. Jaccard gives 0.
- Paraphrase fails entirely - 6/6 wrong.
- Cross-lingual is worse: 6 same-task EN/ZH pairs all score 0.000, all judged new. In any bilingual shop this mechanism collapses on contact.
A defender might say: "code makes a call in 90% of cases, above the 80% we promised." That's a bait-and-switch. The implicit promise of "80% decided by code" is "80% decided correctly." The reality: code issues a verdict in 27 cases and gets 12 right - 44% accuracy. Treating "decided" as "decided correctly" is the most dangerous rhetorical move in the whole design.
The thresholds only work on easy samples (high-overlap same-task, low-overlap new-task): 12/12 correct. The three "common but hard" categories - paraphrase, cross-lingual, antonym - go 0/16. Strongly suggests the thresholds were tuned on the easy set. Any non-trivial sample distribution breaks them immediately.
Experiment 2: Temperature 0 ≠ Determinism
The article sets the evaluator to temperature 0.0, "output almost entirely determined," because "for the same input, the evaluation should be as consistent as possible." This is testable in one sentence: same prompt, temperature 0, run it 20 times, check consistency.
I ran three prompt categories on GLM-5.2, 20 runs each. Result: open-ended output is only 70% consistent; 30% diverges.
| Prompt type | Exact-match rate | Distinct versions |
|---|---|---|
| Math (most stable) | 100% | 1 |
| Structured listing | 95% | 2 |
| Open-ended creative | 70% | 5 |
The open-ended row is the killer - same prompt, temperature 0, 20 runs, 5 different versions, lowest pairwise similarity 0.198:
- "Always head Northbound for your daily cup of exceptional coffee."
- "Premium coffee for the journey ahead."
Almost no shared characters. And the LLM-as-Judge evaluator outputs exactly this kind of open text - done / phase_done / reason / evidence. The article says "the evaluator isn't creative writing, it's judgment, so temperature must be 0." But the evaluator's reason and evidence fields are inherently open; measured divergence is on the same order as creative prompts.
Even "structured listing" is unstable: five adjectives in a different order. If evidence is a list and the order changes, downstream JSON changes, the decision changes. The only 100%-deterministic case is "17×23=391." Which proves the rule: temperature-0 determinism holds only when the answer space is razor-thin. The moment the output has any openness, determinism breaks. Treating a narrow special case as a universal property is overgeneralization.
Evaluator reproducibility is the foundation of the entire loop engine. Unstable evaluation → unstable done signal to the phase gate → unstable decision state machine. The foundation shakes, and ten layers of "deterministic constraints" stacked on top are standing on a shaking base.
(Only tested one provider, GLM-5.2. But the article's claim is universal, so single-provider falsification suffices. OpenAI's temp-0 non-determinism is documented and independently confirmed; more providers would only strengthen this.)
Experiment 3: Phase Gate ≠ Task Completion
The most confident line in the genre: "task completion, transformed from an LLM's self-claim into a verifiable objective fact." The phase gate checks four things: did the script exit 0, does the file exist, is the file count met, is there a user-confirmation record. All in code, all checking "objective facts."
The problem - these checks verify that an action happened, not that the result is correct. I implemented the phase gate per the article's description and built 8 scenarios: 4 with correct content, 4 with garbage content that still satisfies the gate. Result: 100% gate pass rate, 50% content correctness, 50% false-positive rate.
The four false positives, in their own words:
| Task | Actual output | Gate verdict |
|---|---|---|
| Write a research brief | "I am a little duck, quack quack." | ✅ pass → "complete" |
| Draft covering ≥3 mechanisms | "." (a single period) | ✅ pass → "complete" |
| Generate 3 chapter files | 3 files containing "TODO" | ✅ pass → "complete" |
| Run the tests | 0 passed (no tests collected), exit 0 | ✅ pass → "complete" |
A duck, a period, TODO, zero test cases - the phase gate waves all of them through. It has zero discrimination on content correctness. This isn't an implementation bug. The four checks it describes don't read content by construction; any faithful implementation has the same blind spot.
Exit 0 means the process didn't crash, not that the result is right. File-exists means the path is there, not that the content meets the requirement. Packaging "file exists / script ran" as "task complete" is an over-extension of the claim.
The truth: the phase gate turns "an action happened" into an objective fact. It does not turn "the task is done" into an objective fact. Between those two lies a semantic gap it cannot cross. That gap is called content quality - which is exactly what production users care most about.
Three Pillars, All Cracked
The genre's thesis sentence: "stack deterministic constraints on top of the LLM's uncertainty." Now all three "determinisms" are punched through by measurement:
| Pillar | Article claim | Measured | Status |
|---|---|---|---|
| Lexical overlap = semantics | "80% decided by code" | 50% misclassified, 44% accuracy | ❌ |
| Temperature 0 = determinism | "almost entirely determined" | Open output 70% consistent | ❌ |
| Phase gate = task completion | "verifiable objective fact" | 50% false positives | ❌ |
All three foundation layers leak. The ten layers of constraints above stand on a leaking base. The 7000 lines of Rust are probably real. But they guard the symbolic layer - string matching, file paths, exit codes. The semantic layer (intent, content, quality) is still running naked.
Why This Genre Goes Viral
It lands precisely on the anxiety of readers who've built a demo but never hit production. To someone who hasn't run an LLM system in production, the mechanism pile feels heavyweight and authoritative - they haven't seen these practices, and don't know they fail at the semantic layer.
Anyone who has run production reads it and thinks "the names are nicer than the contents":
- Pre-AL gate is prompt-injected state
- Temperature-0 LLM-as-Judge is evaluator hygiene
- "Determinism-first" is try/catch plus string matching
- Phase gate is validation logic
- Ten priority levels are an if-else chain
Every mechanism is correct and worth doing - but naming each one with a proprietary term to manufacture the impression of "an original framework" is rebranding, not innovation.
The harder wound: these articles open with "not pseudocode, not a concept diagram," then deliver zero lines of real code - only function names, constants, parameter values. Those are identifiers, not code. The promise isn't kept. And the thing repeatedly cited as evidence of "production-grade" - "7000+ lines" - appears three times. Line count is the worst proxy for quality. A system that actually runs in production should produce SLO data, postmortems, load-test curves - not line counts.
Fourth Cut: I Lied Too
The first three cuts target the genre's three pillars of "determinism." Data speaks; all three break. But I have to be honest here: I had a "constructive upgrade" ready behind those three cuts - embedding to upgrade lexical overlap, multi-vote to patch temperature 0, a second LLM to backstop the phase gate. I thought it would lift the article from "criticism" to "construction."
I was wrong. That proposal has the same disease as the articles it criticizes: using complicated engineering to fake a semantic solution.
I ran an experiment to convince myself. Not on the target - on my own proposal. I used Qwen3-embedding:0.6b (a real neural embedding model, 1024 dimensions) on the exact same synonymy-vs-antonymy separation test. Result:
| Category | Mean | Min | Max |
|---|---|---|---|
| Synonyms (should be high) | 0.766 | 0.490 | 0.977 |
| Antonyms (should be mid-low) | 0.739 | 0.582 | 0.881 |
| Unrelated (should be low) | 0.326 | 0.237 | 0.404 |
Synonyms (0.766) and antonyms (0.739) differ by 0.026 - too close to separate. "optimize code performance" vs "don't optimize code performance" - cosine 0.881, higher than 10 of the 12 synonym pairs. "build a login-registration feature" vs "add the account-auth piece" (these are synonyms) - cosine 0.490, lower than nearly every antonym pair.
The only separation a neural embedding can do is "related vs unrelated" - synonyms/antonyms both sit around 0.75, unrelated drops to 0.326. But the moment the topic is the same and the direction is opposite, embedding fails exactly like Jaccard.
So the entire separation chain - characters to statistics to neural vectors - fails by measurement:
- Jaccard (Exp 1): 50% misclassified. Cannot separate.
- TF-IDF char 2-gram: synonyms 0.072, antonyms 0.222 - direction reversed. Fails.
- Qwen3-embedding (Exp 4): synonyms 0.766, antonyms 0.739, diff 0.026. Fails.
My "embedding upgrade" doesn't survive this data. I'm deleting it and replacing it with the honest version.
Honest Conclusion: Under the Current Stack, This Problem Has No Engineering Solution
The genre's three "determinism" pillars all collapse. My attempt to patch them with embedding, multi-vote, and a second LLM also fails:
- Embedding cannot separate synonymy from antonymy - same topic, opposite direction produces near-identical vectors.
- A second LLM doesn't fix the first one's unreliability - the inspector itself hallucinates; it just shifts the problem up one layer.
So: when a user interjects something directionally ambiguous (new task or addendum? same direction or opposite?) into the current topic, engineering should not let an algorithm decide unilaterally. Detect topic overlap, then ask the human. Don't auto-adjudicate.
This isn't cowardice. It's an honest choice of objective function: correctness outranks autonomy. If you want an unattended autonomous agent - neither the genre's design nor mine gets you there today. If you must guarantee no misclassification - human confirmation is the only known strategy.
"LLM does symbolic-layer work; humans override on semantic judgment" isn't sexy. But it doesn't lie.
The Question to Ask Before Implementing
If you read one of these articles and are about to build a similar system, ask yourself first: Can your task's output be objectively verified for correctness - not just existence?
If "no" (most content-generation, analysis, and conversational tasks are no), most of the genre's design doesn't apply to you. You need strong human review, cross-model verification, and user-feedback loops - not file-existence checks.
If "yes," still re-tune the parameters yourself, redesign the acceptance criteria, and reserve plenty of human-fallback channels. Don't copy 0.24/0.08. Don't trust temperature 0 to give you determinism. Don't assume a passed phase gate means the task is done. Don't assume swapping in an embedding model buys you semantics.
Each of those four "don'ts" has measured data behind it.
Reproducible Scripts
All four scripts are public, one-click runnable, no cherry-picking. Swap in your own business data and rerun.
Repo: github.com/zxpmail/blog → agent-determinism-illusions/scripts:
- Exp 1 (local, no API):
lexical-overlap-test.py- 30 labeled pairs against the
Comments
No comments yet. Start the discussion.