Logic-Graph Verification System
DEV Community

Logic-Graph Verification System

Logic-Graph Verification System

Status: Draft v1 Β· 2026-07-02
Author: design session (Claude Fable 5)
Implementer: any capable coding model, per SYSTEM-DESIGN.md

1. One-liner

Convert the cheap token throughput of open-weight thinking models into verified conclusions, by externalizing their reasoning as a graph, sampling it N independent times, and letting mechanical graph checks - not model vibes - decide what survives.

2. Problem

Self-correction by prompting fails. Re-feeding a model its own chain-of-thought ("let's revisit...") makes it anchor to its own tokens and rubber-stamp itself. Known result (Huang et al., "LLMs Cannot Self-Correct Reasoning Yet").

tokens/sec is a garbage metric. A cheap model emitting 1M tokens of confident nonsense is worth less than nothing. The metric that matters is correct conclusions per dollar.

Verification-by-LLM inherits the problem. Asking a model "is this true?" just moves the vibes one level up. We need checks that are mechanical - computable, deterministic, not model-judged.

Cheap models confabulate from pretraining. A system whose claims must trace to imported sources cannot tolerate a model that silently answers from its weights.

3. Core insights (the product thesis)

Shape is checkable even when truth isn't. An argument externalized as a directed graph admits mechanical checks: unsupported premises (orphans), circular reasoning (cycles), disconnected conclusions (reachability), refuted-claims-still-load-bearing. "Truthiness" becomes topology.

Confidence in a conclusion ∝ how many node-disjoint chains connect evidence to it (Menger's theorem). One road to Rome = fragile; three independent roads = an adversary must break three separate claims.

Independence is sampled, not requested. Asking one context for "3 independent arguments" yields the same argument paraphrased 3 times. Asking N independent inference calls (fresh context, temperature > 0) yields genuinely decorrelated attempts - the detective's technique: separate interrogations, then cross-examine. Contradictions between runs are the interrogation slip-ups; they are invisible inside any single transcript.

Direct attention, don't diffuse it. "Revisit your reasoning" makes a model skim everything shallowly. "Edge 7 is load-bearing and unverified" makes it spend its whole budget on the right spot. Min-cut/betweenness computes where that spot is.

The formalization tax is a feature. A vague step becomes visibly vague when it must be written as a typed node with typed edges.

4. Users & use cases

Primary user: a developer/researcher running multi-step reasoning or research-over-documents workloads on cheap open-weight models (via OpenRouter or local inference), who needs reliability without frontier-model prices.

Use cases:

  • U1: Multi-hop question answering over provided documents, with an auditable argument graph as output.
  • U2: Claim verification - "here is a conclusion; how much would have to be wrong for it to fail?"
  • U3: As an MCP tool available to any agent, giving it external working memory for arguments plus structural self-checks.

5. Goals & success metrics

# Goal Metric Target
G1 Screening identifies grounded models screen produces grounded% / memorized% / abstain% per model screen runs end-to-end < $1
G2 System beats naive self-correction accuracy on eval set vs "revisit" and verbatim re-injection arms β‰₯ +5pp at equal or lower cost
G3 Cheap ensemble beats one big call correct-per-dollar vs one frontier-class call β‰₯ 2Γ— correct-per-dollar
G4 Confidence is calibrated by structure conclusions with support width β‰₯ 2 are correct more often than width 1 monotone relationship observed
G5 Auditable output every conclusion ships with its graph: surviving claims, width, disputed nodes 100% of runs

6. Non-goals

  • Not a truth oracle. The graph verifies argument shape; every edge is semantically vouched for only by a model. Validly-connected garbage passes structure checks - by design, the structural layer only decides where semantic effort (re-interrogation) goes.
  • No fine-tuning, no inference-loop surgery. Everything works through standard chat-completions + tool calls. No custom sampling, no template hacking, no stop-at </think> interception.
  • No persistence. Graphs are session-scoped and ephemeral (in-memory). No database.
  • Not a general theorem prover. Natural-language claims, heuristic dedup - not formal logic.

System Design

1. Architecture

Two deployment modes for the graph server, same codebase:

  • Library mode (default for orchestrator): orchestrator imports graph_server.store directly, no MCP transport. Simpler, faster, use for P3/P4.
  • MCP mode: server.py exposes the same functions as MCP tools over stdio, so any MCP-capable agent (Claude Code, etc.) can use the graph as external working memory. Thin wrapper only - zero logic in the transport layer.

2. Component A - Grounding screen (P1, exists)

Already implemented at repo root: screen.py, items.jsonl (50 items), models.json. Contract: temperature 0, 1 call/item/model, regex-graded. See README.md. Output feeds one decision: which model id the orchestrator defaults to. Do not modify except to fix bugs.

3. Component B - Graph server (P2)

3.1 Data model

In-memory only. One GraphStore holds many independent graphs keyed by graph_id (string, caller-supplied).

Node:

  • id: str - caller-supplied, unique within graph, e.g. "r2:n4" (run 2, node 4)
  • claim: str - natural-language claim, one sentence
  • type: str - "given" | "inference" | "assumption" | "conclusion"
  • confidence: float - 0.0-1.0, model-stated, default 0.8
  • run_ids: set[str] - provenance: which interrogation runs asserted this claim
  • refuted: bool - default False, set only via mark_refuted
  • refute_reason: str | None
  • aliases: list[str] - claims of nodes merged into this one (dedup provenance)

Edge:

  • src: str, dst: str
  • relation: str - "supports" | "attacks" | "assumes"
  • confidence: float - 0.0-1.0, default 0.8
  • run_ids: set[str]

Backed by one networkx.MultiDiGraph per graph_id (multi because supports and attacks may both exist between the same pair). Node attrs hold the Node fields; edge attrs hold relation/confidence/run_ids.

Validation on ingest (reject with error message, don't crash): unknown type/relation, edge endpoints not in graph after this ingest, confidence outside [0,1], duplicate node id with different claim text (same id + same claim = idempotent, merge run_ids).

3.2 Public functions / MCP tools

Eight functions. Signatures below are the library API; MCP tool schemas mirror them 1:1 with graph_id always the first param.

# Function Params Returns
1 assert_graph graph_id, nodes[], edges[] {accepted_nodes, accepted_edges, rejected: [{item, reason}], auto_merged: [[kept_id, merged_id], ...]} - runs dedup (Β§3.3) on the new nodes against existing ones automatically
2 merge_duplicates graph_id, jaccard_threshold=0.7, ratio_threshold=0.85 {merges: [[kept_id, merged_id]], contradictions_created: [[id_a, id_b]]} - full-graph re-pass
3 check_structure graph_id, conclusion_id {orphans: [id], cycles: [[id,...]], unreachable_conclusion: bool, refuted_but_feeding: [id]}
4 critical_links graph_id, conclusion_id {min_cut_nodes: [id], bridge_edges: [[src,dst]], ranked: [{edge, betweenness, min_confidence_on_edge}]} sorted least-confident-first
5 support_width graph_id, conclusion_id {disjoint_paths: int, paths: [[id,...]], max_flow: float}
6 surviving_claims graph_id {in: [id], out: [id], undecided: [id], surviving: [id]} (Β§3.5)
7 mark_refuted graph_id, node_id, reason {ok, width_before, width_after} for the graph's conclusion node if one exists
8 disputed_nodes graph_id, conclusion_id {contradiction_pairs: [[id_a, id_b]], isolated_load_bearing: [{id, run_count, on_path}]} (Β§3.6)

Every function MUST be pure-read except 1, 2, 7. Every function MUST return JSON-serializable dicts. Errors return {error: "..."}, never raise across the MCP boundary.

3.3 Dedup / merge (THE load-bearing component)

Purpose: identical claims from different runs MUST collapse to one node, otherwise (a) support width is inflated by paraphrase and (b) contradictions never meet each other.

Normalization norm(claim):

  • lowercase; unicode NFC
  • strip punctuation except . inside numbers and %
  • canonicalize numbers: remove thousands separators (84,200 β†’ 84200)
  • collapse whitespace
  • tokenize on whitespace; drop English stopwords (small fixed list: a, an, the, is, are, was, were, of, in, on, at, to, that, this, it, and)

Negation guard (checked BEFORE similarity): let NEG = {not, no, never, cannot, "n't", without, false}. If tokens(a) - NEG == tokens(b) - NEG but the two differ in negation-token count parity β†’ DO NOT merge; instead create mutual attacks edges between them and report in contradictions_created.

Numeric-conflict guard: if token sets are equal except for differing numeric tokens (e.g. "trellium melts at 412" vs "trellium melts at 350") β†’ do not merge; create mutual attacks edges and report.

Similarity (only if guards pass): merge iff jaccard(tokens_a, tokens_b) >= 0.7 OR difflib.SequenceMatcher(None, norm_a, norm_b).ratio() >= 0.85. Both thresholds tunable via config. (v1 upgrade path, out of scope: cosine over a small sentence-embedding model; keep the interface identical.)

Merge policy: keep the node with the earliest assertion (stable order: first run_id, then node id). Union run_ids and aliases; keep max confidence; re-point all edges to the kept node; collapse resulting parallel edges of the same relation (max confidence, union run_ids). given type wins over inference wins over assumption if merged nodes disagree on type. Never merge a refuted=True node into a live one - merge live into refuted keeps refuted.

Clustering: union-find over pairwise matches (O(nΒ²) pairwise is fine; graphs are hundreds of nodes, not millions).

3.4 Structural checks (networkx recipes)

Work on the supports+assumes subgraph (attacks edges excluded) unless stated. Build D = nx.DiGraph view collapsing the multigraph, keeping max confidence per (src,dst).

  • orphans: nodes with in-degree 0 that are not type=given and not type=assumption (assumptions are declared floats; still reported separately as assumptions in the payload - implementer: include key assumptions: [id]).
  • cycles: list(nx.simple_cycles(D)), cap output at first 10.
  • unreachable_conclusion: conclusion not in descendants(D, g) for any given g. Compute once: reachable = union of nx.descendants(D, g) | {g} for all givens; unreachable iff conclusion not in reachable.
  • refuted_but_feeding: nodes with refuted=True that still have a path to the conclusion in D.
  • critical_links: nx.minimum_node_cut(D_aug, s, t) and betweenness via nx.betweenness_centrality_subset(D, sources=givens, targets=[conclusion]). For s-t computations with multiple sources add a virtual super-source __S__ with edges to every given (confidence 1.0), and use conclusion as sink; NEVER report __S__ in results.
  • Bridge edges: edges whose removal makes conclusion unreachable (test each edge on the paths; graphs are small, brute force is fine).
3.5 support_width & surviving_claims

support_width (Menger): on D_aug (super-source __S__ β†’ all non-refuted givens): disjoint_paths = len(list(nx.node_disjoint_paths(D_aug, "__S__", conclusion))) - strip __S__ from reported paths. Handle NetworkXNoPath β†’ 0.

max_flow: capacities = edge confidence; node capacities = node confidence (implement via standard node-splitting: v β†’ v_in, v_out with capacity = confidence). nx.maximum_flow_value on the split graph. Givens and __S__ edges get capacity ∞ (use 1e9).

surviving_claims (grounded semantics over attacks, then reachability):

  • A = subgraph with attacks edges only (all nodes, attack edges)
  • label all nodes UNDEC
  • repeat until no change:
    • node -> IN if every attacker of it is OUT (vacuously true if no attackers)
    • node -> OUT if some attacker of it is IN
  • refuted nodes are forced OUT before the loop starts
  • surviving = { n : label(n) != OUT and (type(n) == "given" or n reachable from some IN/UNDEC given via supports/assumes edges through only non-OUT nodes) }
  • Return the full labelling plus surviving.
3.6 disputed_nodes
  • contradiction_pairs: all pairs connected by mutual attacks created by the guards in Β§3.3, plus any pair the caller asserted with explicit mutual attacks.
  • isolated_load_bearing: nodes with len(run_ids) == 1 AND (betweenness_subset > 0 from givens to conclusion OR node lies on any givenβ†’conclusion simple path). These are the prime confabulation suspects; the orchestrator re-interrogates them first.

4. Component C - Orchestrator (P3)

CLI: python -m orchestrator.run --task task.json --model <id> --n 6 --k 2 --budget-calls 20 --temp 0.8

task.json: {"question": str, "documents": [str], "expected_answer": str|null} (expected_answer only used by the eval harness).

4.1 Control loop
  1. FAN OUT: N parallel interrogation calls (fresh context each, temperature=temp). Each call gets: system prompt (Β§4.2) + documents + question. Parse each response's JSON graph (Β§4.3). assert_graph with run_id="r{i}", node ids prefixed "r{i}:".

  2. MERGE: merge_duplicates(graph_id).

  3. ASSESS: check_structure, support_width(conclusion), disputed_nodes. The conclusion node: each run marks exactly one node type=conclusion; after merge there should be 1..N conclusion nodes. If >1 cluster of conclusions survive dedup, the runs DISAGREE on the answer itself -> treat each cluster as a candidate; the report ranks candidates by support width.

  4. RE-INTERROGATE (while budget remains and disputed set nonempty): For each disputed node (contradiction pairs first, then isolated load-bearing, max 3 per round): send M=3 fresh verification calls (Β§4.2 template B: the claim + documents ONLY - no graph, no prior transcript). Majority verdict:

    • refuted by >=2 of 3 -> mark_refuted(node, reason)
    • confirmed by >=2 of 3 -> bump confidence to max(old, 0.9), add run_id "v*"
  5. STOP when: (a) no disputed nodes remain, or (b) all conclusion candidates stable AND top candidate width >= k, or (c) budget-calls exhausted.

Comments

No comments yet. Start the discussion.