DEV Community 4h ago

Logic-Graph Verification System

Status: Draft v1 · 2026-07-02
Author: design session (Claude Fable 5)
Implementer: any capable coding model, per SYSTEM-DESIGN.md

1. One-liner

Convert the cheap token throughput of open-weight thinking models into verified conclusions, by externalizing their reasoning as a graph, sampling it N independent times, and letting mechanical graph checks - not model vibes - decide what survives.

2. Problem

Self-correction by prompting fails. Re-feeding a model its own chain-of-thought ("let's revisit...") makes it anchor to its own tokens and rubber-stamp itself. Known result (Huang et al., "LLMs Cannot Self-Correct Reasoning Yet").

tokens/sec is a garbage metric. A cheap model emitting 1M tokens of confident nonsense is worth less than nothing. The metric that matters is correct conclusions per dollar.

Verification-by-LLM inherits the problem. Asking a model "is this true?" just moves the vibes one level up. We need checks that are mechanical - computable, deterministic, not model-judged.

Cheap models confabulate from pretraining. A system whose claims must trace to imported sources cannot tolerate a model that silently answers from its weights.

3. Core insights (the product thesis)

Shape is checkable even when truth isn't. An argument externalized as a directed graph admits mechanical checks: unsupported premises (orphans), circular reasoning (cycles), disconnected conclusions (reachability), refuted-claims-still-load-bearing. "Truthiness" becomes topology.

Confidence in a conclusion ∝ how many node-disjoint chains connect evidence to it (Menger's theorem). One road to Rome = fragile; three independent roads = an adversary must break three separate claims.

Independence is sampled, not requested. Asking one context for "3 independent arguments" yields the same argument paraphrased 3 times. Asking N independent inference calls (fresh context, temperature > 0) yields genuinely decorrelated attempts - the detective's technique: separate interrogations, then cross-examine. Contradictions between runs are the interrogation slip-ups; they are invisible inside any single transcript.

Direct attention, don't diffuse it. "Revisit your reasoning" makes a model skim everything shallowly. "Edge 7 is load-bearing and unverified" makes it spend its whole budget on the right spot. Min-cut/betweenness computes where that spot is.

The formalization tax is a feature. A vague step becomes visibly vague when it must be written as a typed node with typed edges.

4. Users & use cases

Primary user: a developer/researcher running multi-step reasoning or research-over-documents workloads on cheap open-weight models (via OpenRouter or local inference), who needs reliability without frontier-model prices.

Use cases:

U1: Multi-hop question answering over provided documents, with an auditable argument graph as output.
U2: Claim verification - "here is a conclusion; how much would have to be wrong for it to fail?"
U3: As an MCP tool available to any agent, giving it external working memory for arguments plus structural self-checks.

5. Goals & success metrics

#	Goal	Metric	Target
G1	Screening identifies grounded models	screen produces grounded% / memorized% / abstain% per model	screen runs end-to-end < $1
G2	System beats naive self-correction	accuracy on eval set vs "revisit" and verbatim re-injection arms	≥ +5pp at equal or lower cost
G3	Cheap ensemble beats one big call	correct-per-dollar vs one frontier-class call	≥ 2× correct-per-dollar
G4	Confidence is calibrated by structure	conclusions with support width ≥ 2 are correct more often than width 1	monotone relationship observed
G5	Auditable output	every conclusion ships with its graph: surviving claims, width, disputed nodes	100% of runs

6. Non-goals

Not a truth oracle. The graph verifies argument shape; every edge is semantically vouched for only by a model. Validly-connected garbage passes structure checks - by design, the structural layer only decides where semantic effort (re-interrogation) goes.
No fine-tuning, no inference-loop surgery. Everything works through standard chat-completions + tool calls. No custom sampling, no template hacking, no stop-at </think> interception.
No persistence. Graphs are session-scoped and ephemeral (in-memory). No database.
Not a general theorem prover. Natural-language claims, heuristic dedup - not formal logic.

System Design

1. Architecture

Two deployment modes for the graph server, same codebase:

Library mode (default for orchestrator): orchestrator imports graph_server.store directly, no MCP transport. Simpler, faster, use for P3/P4.
MCP mode: server.py exposes the same functions as MCP tools over stdio, so any MCP-capable agent (Claude Code, etc.) can use the graph as external working memory. Thin wrapper only - zero logic in the transport layer.

2. Component A - Grounding screen (P1, exists)

Already implemented at repo root: screen.py, items.jsonl (50 items), models.json. Contract: temperature 0, 1 call/item/model, regex-graded. See README.md. Output feeds one decision: which model id the orchestrator defaults to. Do not modify except to fix bugs.

3. Component B - Graph server (P2)

3.1 Data model

In-memory only. One GraphStore holds many independent graphs keyed by graph_id (string, caller-supplied).

Node:

id: str - caller-supplied, unique within graph, e.g. "r2:n4" (run 2, node 4)
claim: str - natural-language claim, one sentence
type: str - "given" | "inference" | "assumption" | "conclusion"
confidence: float - 0.0-1.0, model-stated, default 0.8
run_ids: set[str] - provenance: which interrogation runs asserted this claim
refuted: bool - default False, set only via mark_refuted
refute_reason: str | None
aliases: list[str] - claims of nodes merged into this one (dedup provenance)

Edge:

src: str, dst: str
relation: str - "supports" | "attacks" | "assumes"
confidence: float - 0.0-1.0, default 0.8
run_ids: set[str]

Backed by one networkx.MultiDiGraph per graph_id (multi because supports and attacks may both exist between the same pair). Node attrs hold the Node fields; edge attrs hold relation/confidence/run_ids.

Validation on ingest (reject with error message, don't crash): unknown type/relation, edge endpoints not in graph after this ingest, confidence outside [0,1], duplicate node id with different claim text (same id + same claim = idempotent, merge run_ids).

3.2 Public functions / MCP tools

Eight functions. Signatures below are the library API; MCP tool schemas mirror them 1:1 with graph_id always the first param.

#	Function	Params	Returns
1	`assert_graph`	graph_id, nodes[], edges[]	`{accepted_nodes, accepted_edges, rejected: [{item, reason}], auto_merged: [[kept_id, merged_id], ...]}` - runs dedup (§3.3) on the new nodes against existing ones automatically
2	`merge_duplicates`	graph_id, jaccard_threshold=0.7, ratio_threshold=0.85	`{merges: [[kept_id, merged_id]], contradictions_created: [[id_a, id_b]]}` - full-graph re-pass
3	`check_structure`	graph_id, conclusion_id	`{orphans: [id], cycles: [[id,...]], unreachable_conclusion: bool, refuted_but_feeding: [id]}`
4	`critical_links`	graph_id, conclusion_id	`{min_cut_nodes: [id], bridge_edges: [[src,dst]], ranked: [{edge, betweenness, min_confidence_on_edge}]}` sorted least-confident-first
5	`support_width`	graph_id, conclusion_id	`{disjoint_paths: int, paths: [[id,...]], max_flow: float}`
6	`surviving_claims`	graph_id	`{in: [id], out: [id], undecided: [id], surviving: [id]}` (§3.5)
7	`mark_refuted`	graph_id, node_id, reason	`{ok, width_before, width_after}` for the graph's conclusion node if one exists
8	`disputed_nodes`	graph_id, conclusion_id	`{contradiction_pairs: [[id_a, id_b]], isolated_load_bearing: [{id, run_count, on_path}]}` (§3.6)

Every function MUST be pure-read except 1, 2, 7. Every function MUST return JSON-serializable dicts. Errors return {error: "..."}, never raise across the MCP boundary.

3.3 Dedup / merge (THE load-bearing component)

Purpose: identical claims from different runs MUST collapse to one node, otherwise (a) support width is inflated by paraphrase and (b) contradictions never meet each other.

Normalization norm(claim):

lowercase; unicode NFC
strip punctuation except . inside numbers and %
canonicalize numbers: remove thousands separators (84,200 → 84200)
collapse whitespace
tokenize on whitespace; drop English stopwords (small fixed list: a, an, the, is, are, was, were, of, in, on, at, to, that, this, it, and)

Negation guard (checked BEFORE similarity): let NEG = {not, no, never, cannot, "n't", without, false}. If tokens(a) - NEG == tokens(b) - NEG but the two differ in negation-token count parity → DO NOT merge; instead create mutual attacks edges between them and report in contradictions_created.

Numeric-conflict guard: if token sets are equal except for differing numeric tokens (e.g. "trellium melts at 412" vs "trellium melts at 350") → do not merge; create mutual attacks edges and report.

Similarity (only if guards pass): merge iff jaccard(tokens_a, tokens_b) >= 0.7 OR difflib.SequenceMatcher(None, norm_a, norm_b).ratio() >= 0.85. Both thresholds tunable via config. (v1 upgrade path, out of scope: cosine over a small sentence-embedding model; keep the interface identical.)

Merge policy: keep the node with the earliest assertion (stable order: first run_id, then node id). Union run_ids and aliases; keep max confidence; re-point all edges to the kept node; collapse resulting parallel edges of the same relation (max confidence, union run_ids). given type wins over inference wins over assumption if merged nodes disagree on type. Never merge a refuted=True node into a live one - merge live into refuted keeps refuted.

Clustering: union-find over pairwise matches (O(n²) pairwise is fine; graphs are hundreds of nodes, not millions).

3.4 Structural checks (networkx recipes)

Work on the supports+assumes subgraph (attacks edges excluded) unless stated. Build D = nx.DiGraph view collapsing the multigraph, keeping max confidence per (src,dst).

orphans: nodes with in-degree 0 that are not type=given and not type=assumption (assumptions are declared floats; still reported separately as assumptions in the payload - implementer: include key assumptions: [id]).
cycles: list(nx.simple_cycles(D)), cap output at first 10.
unreachable_conclusion: conclusion not in descendants(D, g) for any given g. Compute once: reachable = union of nx.descendants(D, g) | {g} for all givens; unreachable iff conclusion not in reachable.
refuted_but_feeding: nodes with refuted=True that still have a path to the conclusion in D.
critical_links: nx.minimum_node_cut(D_aug, s, t) and betweenness via nx.betweenness_centrality_subset(D, sources=givens, targets=[conclusion]). For s-t computations with multiple sources add a virtual super-source __S__ with edges to every given (confidence 1.0), and use conclusion as sink; NEVER report __S__ in results.
Bridge edges: edges whose removal makes conclusion unreachable (test each edge on the paths; graphs are small, brute force is fine).

3.5 support_width & surviving_claims

support_width (Menger): on D_aug (super-source __S__ → all non-refuted givens): disjoint_paths = len(list(nx.node_disjoint_paths(D_aug, "__S__", conclusion))) - strip __S__ from reported paths. Handle NetworkXNoPath → 0.

max_flow: capacities = edge confidence; node capacities = node confidence (implement via standard node-splitting: v → v_in, v_out with capacity = confidence). nx.maximum_flow_value on the split graph. Givens and __S__ edges get capacity ∞ (use 1e9).

surviving_claims (grounded semantics over attacks, then reachability):

A = subgraph with attacks edges only (all nodes, attack edges)
label all nodes UNDEC
repeat until no change:
- node -> IN if every attacker of it is OUT (vacuously true if no attackers)
- node -> OUT if some attacker of it is IN
refuted nodes are forced OUT before the loop starts
surviving = { n : label(n) != OUT and (type(n) == "given" or n reachable from some IN/UNDEC given via supports/assumes edges through only non-OUT nodes) }
Return the full labelling plus surviving.

3.6 disputed_nodes

contradiction_pairs: all pairs connected by mutual attacks created by the guards in §3.3, plus any pair the caller asserted with explicit mutual attacks.
isolated_load_bearing: nodes with len(run_ids) == 1 AND (betweenness_subset > 0 from givens to conclusion OR node lies on any given→conclusion simple path). These are the prime confabulation suspects; the orchestrator re-interrogates them first.

4. Component C - Orchestrator (P3)

CLI: python -m orchestrator.run --task task.json --model <id> --n 6 --k 2 --budget-calls 20 --temp 0.8

task.json: {"question": str, "documents": [str], "expected_answer": str|null} (expected_answer only used by the eval harness).

4.1 Control loop

FAN OUT: N parallel interrogation calls (fresh context each, temperature=temp). Each call gets: system prompt (§4.2) + documents + question. Parse each response's JSON graph (§4.3). assert_graph with run_id="r{i}", node ids prefixed "r{i}:".
MERGE: merge_duplicates(graph_id).
ASSESS: check_structure, support_width(conclusion), disputed_nodes. The conclusion node: each run marks exactly one node type=conclusion; after merge there should be 1..N conclusion nodes. If >1 cluster of conclusions survive dedup, the runs DISAGREE on the answer itself -> treat each cluster as a candidate; the report ranks candidates by support width.
RE-INTERROGATE (while budget remains and disputed set nonempty): For each disputed node (contradiction pairs first, then isolated load-bearing, max 3 per round): send M=3 fresh verification calls (§4.2 template B: the claim + documents ONLY - no graph, no prior transcript). Majority verdict:
- refuted by >=2 of 3 -> mark_refuted(node, reason)
- confirmed by >=2 of 3 -> bump confidence to max(old, 0.9), add run_id "v*"
STOP when: (a) no disputed nodes remain, or (b) all conclusion candidates stable AND top candidate width >= k, or (c) budget-calls exhausted.

Read on DEV Community ↗ ← Back to News

Logic-Graph Verification System