Logic-Graph Verification System
Logic-Graph Verification System
Status: Draft v1 Β· 2026-07-02
Author: design session (Claude Fable 5)
Implementer: any capable coding model, per SYSTEM-DESIGN.md
1. One-liner
Convert the cheap token throughput of open-weight thinking models into verified conclusions, by externalizing their reasoning as a graph, sampling it N independent times, and letting mechanical graph checks - not model vibes - decide what survives.
2. Problem
Self-correction by prompting fails. Re-feeding a model its own chain-of-thought ("let's revisit...") makes it anchor to its own tokens and rubber-stamp itself. Known result (Huang et al., "LLMs Cannot Self-Correct Reasoning Yet").
tokens/sec is a garbage metric. A cheap model emitting 1M tokens of confident nonsense is worth less than nothing. The metric that matters is correct conclusions per dollar.
Verification-by-LLM inherits the problem. Asking a model "is this true?" just moves the vibes one level up. We need checks that are mechanical - computable, deterministic, not model-judged.
Cheap models confabulate from pretraining. A system whose claims must trace to imported sources cannot tolerate a model that silently answers from its weights.
3. Core insights (the product thesis)
Shape is checkable even when truth isn't. An argument externalized as a directed graph admits mechanical checks: unsupported premises (orphans), circular reasoning (cycles), disconnected conclusions (reachability), refuted-claims-still-load-bearing. "Truthiness" becomes topology.
Confidence in a conclusion β how many node-disjoint chains connect evidence to it (Menger's theorem). One road to Rome = fragile; three independent roads = an adversary must break three separate claims.
Independence is sampled, not requested. Asking one context for "3 independent arguments" yields the same argument paraphrased 3 times. Asking N independent inference calls (fresh context, temperature > 0) yields genuinely decorrelated attempts - the detective's technique: separate interrogations, then cross-examine. Contradictions between runs are the interrogation slip-ups; they are invisible inside any single transcript.
Direct attention, don't diffuse it. "Revisit your reasoning" makes a model skim everything shallowly. "Edge 7 is load-bearing and unverified" makes it spend its whole budget on the right spot. Min-cut/betweenness computes where that spot is.
The formalization tax is a feature. A vague step becomes visibly vague when it must be written as a typed node with typed edges.
4. Users & use cases
Primary user: a developer/researcher running multi-step reasoning or research-over-documents workloads on cheap open-weight models (via OpenRouter or local inference), who needs reliability without frontier-model prices.
Use cases:
- U1: Multi-hop question answering over provided documents, with an auditable argument graph as output.
- U2: Claim verification - "here is a conclusion; how much would have to be wrong for it to fail?"
- U3: As an MCP tool available to any agent, giving it external working memory for arguments plus structural self-checks.
5. Goals & success metrics
| # | Goal | Metric | Target |
|---|---|---|---|
| G1 | Screening identifies grounded models | screen produces grounded% / memorized% / abstain% per model | screen runs end-to-end < $1 |
| G2 | System beats naive self-correction | accuracy on eval set vs "revisit" and verbatim re-injection arms | β₯ +5pp at equal or lower cost |
| G3 | Cheap ensemble beats one big call | correct-per-dollar vs one frontier-class call | β₯ 2Γ correct-per-dollar |
| G4 | Confidence is calibrated by structure | conclusions with support width β₯ 2 are correct more often than width 1 | monotone relationship observed |
| G5 | Auditable output | every conclusion ships with its graph: surviving claims, width, disputed nodes | 100% of runs |
6. Non-goals
- Not a truth oracle. The graph verifies argument shape; every edge is semantically vouched for only by a model. Validly-connected garbage passes structure checks - by design, the structural layer only decides where semantic effort (re-interrogation) goes.
- No fine-tuning, no inference-loop surgery. Everything works through standard chat-completions + tool calls. No custom sampling, no template hacking, no stop-at
</think>interception. - No persistence. Graphs are session-scoped and ephemeral (in-memory). No database.
- Not a general theorem prover. Natural-language claims, heuristic dedup - not formal logic.
System Design
1. Architecture
Two deployment modes for the graph server, same codebase:
- Library mode (default for orchestrator): orchestrator imports
graph_server.storedirectly, no MCP transport. Simpler, faster, use for P3/P4. - MCP mode:
server.pyexposes the same functions as MCP tools over stdio, so any MCP-capable agent (Claude Code, etc.) can use the graph as external working memory. Thin wrapper only - zero logic in the transport layer.
2. Component A - Grounding screen (P1, exists)
Already implemented at repo root: screen.py, items.jsonl (50 items), models.json. Contract: temperature 0, 1 call/item/model, regex-graded. See README.md. Output feeds one decision: which model id the orchestrator defaults to. Do not modify except to fix bugs.
3. Component B - Graph server (P2)
3.1 Data model
In-memory only. One GraphStore holds many independent graphs keyed by graph_id (string, caller-supplied).
Node:
id: str - caller-supplied, unique within graph, e.g. "r2:n4" (run 2, node 4)claim: str - natural-language claim, one sentencetype: str - "given" | "inference" | "assumption" | "conclusion"confidence: float - 0.0-1.0, model-stated, default 0.8run_ids: set[str] - provenance: which interrogation runs asserted this claimrefuted: bool - default False, set only viamark_refutedrefute_reason: str | Nonealiases: list[str] - claims of nodes merged into this one (dedup provenance)
Edge:
src: str,dst: strrelation: str - "supports" | "attacks" | "assumes"confidence: float - 0.0-1.0, default 0.8run_ids: set[str]
Backed by one networkx.MultiDiGraph per graph_id (multi because supports and attacks may both exist between the same pair). Node attrs hold the Node fields; edge attrs hold relation/confidence/run_ids.
Validation on ingest (reject with error message, don't crash): unknown type/relation, edge endpoints not in graph after this ingest, confidence outside [0,1], duplicate node id with different claim text (same id + same claim = idempotent, merge run_ids).
3.2 Public functions / MCP tools
Eight functions. Signatures below are the library API; MCP tool schemas mirror them 1:1 with graph_id always the first param.
| # | Function | Params | Returns |
|---|---|---|---|
| 1 | assert_graph |
graph_id, nodes[], edges[] | {accepted_nodes, accepted_edges, rejected: [{item, reason}], auto_merged: [[kept_id, merged_id], ...]} - runs dedup (Β§3.3) on the new nodes against existing ones automatically |
| 2 | merge_duplicates |
graph_id, jaccard_threshold=0.7, ratio_threshold=0.85 | {merges: [[kept_id, merged_id]], contradictions_created: [[id_a, id_b]]} - full-graph re-pass |
| 3 | check_structure |
graph_id, conclusion_id | {orphans: [id], cycles: [[id,...]], unreachable_conclusion: bool, refuted_but_feeding: [id]} |
| 4 | critical_links |
graph_id, conclusion_id | {min_cut_nodes: [id], bridge_edges: [[src,dst]], ranked: [{edge, betweenness, min_confidence_on_edge}]} sorted least-confident-first |
| 5 | support_width |
graph_id, conclusion_id | {disjoint_paths: int, paths: [[id,...]], max_flow: float} |
| 6 | surviving_claims |
graph_id | {in: [id], out: [id], undecided: [id], surviving: [id]} (Β§3.5) |
| 7 | mark_refuted |
graph_id, node_id, reason | {ok, width_before, width_after} for the graph's conclusion node if one exists |
| 8 | disputed_nodes |
graph_id, conclusion_id | {contradiction_pairs: [[id_a, id_b]], isolated_load_bearing: [{id, run_count, on_path}]} (Β§3.6) |
Every function MUST be pure-read except 1, 2, 7. Every function MUST return JSON-serializable dicts. Errors return {error: "..."}, never raise across the MCP boundary.
3.3 Dedup / merge (THE load-bearing component)
Purpose: identical claims from different runs MUST collapse to one node, otherwise (a) support width is inflated by paraphrase and (b) contradictions never meet each other.
Normalization norm(claim):
- lowercase; unicode NFC
- strip punctuation except
.inside numbers and% - canonicalize numbers: remove thousands separators (
84,200β84200) - collapse whitespace
- tokenize on whitespace; drop English stopwords (small fixed list: a, an, the, is, are, was, were, of, in, on, at, to, that, this, it, and)
Negation guard (checked BEFORE similarity): let NEG = {not, no, never, cannot, "n't", without, false}. If tokens(a) - NEG == tokens(b) - NEG but the two differ in negation-token count parity β DO NOT merge; instead create mutual attacks edges between them and report in contradictions_created.
Numeric-conflict guard: if token sets are equal except for differing numeric tokens (e.g. "trellium melts at 412" vs "trellium melts at 350") β do not merge; create mutual attacks edges and report.
Similarity (only if guards pass): merge iff jaccard(tokens_a, tokens_b) >= 0.7 OR difflib.SequenceMatcher(None, norm_a, norm_b).ratio() >= 0.85. Both thresholds tunable via config. (v1 upgrade path, out of scope: cosine over a small sentence-embedding model; keep the interface identical.)
Merge policy: keep the node with the earliest assertion (stable order: first run_id, then node id). Union run_ids and aliases; keep max confidence; re-point all edges to the kept node; collapse resulting parallel edges of the same relation (max confidence, union run_ids). given type wins over inference wins over assumption if merged nodes disagree on type. Never merge a refuted=True node into a live one - merge live into refuted keeps refuted.
Clustering: union-find over pairwise matches (O(nΒ²) pairwise is fine; graphs are hundreds of nodes, not millions).
3.4 Structural checks (networkx recipes)
Work on the supports+assumes subgraph (attacks edges excluded) unless stated. Build D = nx.DiGraph view collapsing the multigraph, keeping max confidence per (src,dst).
- orphans: nodes with in-degree 0 that are not
type=givenand nottype=assumption(assumptions are declared floats; still reported separately asassumptionsin the payload - implementer: includekey assumptions: [id]). - cycles:
list(nx.simple_cycles(D)), cap output at first 10. - unreachable_conclusion: conclusion not in
descendants(D, g)for any giveng. Compute once:reachable = union of nx.descendants(D, g) | {g}for all givens; unreachable iff conclusion not in reachable. - refuted_but_feeding: nodes with
refuted=Truethat still have a path to the conclusion in D. - critical_links:
nx.minimum_node_cut(D_aug, s, t)and betweenness vianx.betweenness_centrality_subset(D, sources=givens, targets=[conclusion]). For s-t computations with multiple sources add a virtual super-source__S__with edges to every given (confidence 1.0), and use conclusion as sink; NEVER report__S__in results. - Bridge edges: edges whose removal makes conclusion unreachable (test each edge on the paths; graphs are small, brute force is fine).
3.5 support_width & surviving_claims
support_width (Menger): on D_aug (super-source __S__ β all non-refuted givens): disjoint_paths = len(list(nx.node_disjoint_paths(D_aug, "__S__", conclusion))) - strip __S__ from reported paths. Handle NetworkXNoPath β 0.
max_flow: capacities = edge confidence; node capacities = node confidence (implement via standard node-splitting: v β v_in, v_out with capacity = confidence). nx.maximum_flow_value on the split graph. Givens and __S__ edges get capacity β (use 1e9).
surviving_claims (grounded semantics over attacks, then reachability):
- A = subgraph with attacks edges only (all nodes, attack edges)
- label all nodes UNDEC
- repeat until no change:
- node -> IN if every attacker of it is OUT (vacuously true if no attackers)
- node -> OUT if some attacker of it is IN
- refuted nodes are forced OUT before the loop starts
surviving = { n : label(n) != OUT and (type(n) == "given" or n reachable from some IN/UNDEC given via supports/assumes edges through only non-OUT nodes) }- Return the full labelling plus surviving.
3.6 disputed_nodes
- contradiction_pairs: all pairs connected by mutual attacks created by the guards in Β§3.3, plus any pair the caller asserted with explicit mutual attacks.
- isolated_load_bearing: nodes with
len(run_ids) == 1AND (betweenness_subset > 0 from givens to conclusion OR node lies on any givenβconclusion simple path). These are the prime confabulation suspects; the orchestrator re-interrogates them first.
4. Component C - Orchestrator (P3)
CLI: python -m orchestrator.run --task task.json --model <id> --n 6 --k 2 --budget-calls 20 --temp 0.8
task.json: {"question": str, "documents": [str], "expected_answer": str|null} (expected_answer only used by the eval harness).
4.1 Control loop
FAN OUT: N parallel interrogation calls (fresh context each, temperature=temp). Each call gets: system prompt (Β§4.2) + documents + question. Parse each response's JSON graph (Β§4.3).
assert_graphwithrun_id="r{i}", node ids prefixed"r{i}:".MERGE:
merge_duplicates(graph_id).ASSESS:
check_structure,support_width(conclusion),disputed_nodes. The conclusion node: each run marks exactly one nodetype=conclusion; after merge there should be 1..N conclusion nodes. If >1 cluster of conclusions survive dedup, the runs DISAGREE on the answer itself -> treat each cluster as a candidate; the report ranks candidates by support width.RE-INTERROGATE (while budget remains and disputed set nonempty): For each disputed node (contradiction pairs first, then isolated load-bearing, max 3 per round): send M=3 fresh verification calls (Β§4.2 template B: the claim + documents ONLY - no graph, no prior transcript). Majority verdict:
- refuted by >=2 of 3 ->
mark_refuted(node, reason) - confirmed by >=2 of 3 -> bump confidence to
max(old, 0.9), add run_id"v*"
- refuted by >=2 of 3 ->
STOP when: (a) no disputed nodes remain, or (b) all conclusion candidates stable AND top candidate width >= k, or (c) budget-calls exhausted.
Comments
No comments yet. Start the discussion.