How Lume Works: The Retrieval Primitives
How Lume Works: The Retrieval Primitives
Comments
Lume is a Rust hybrid search engine that Steve Harris and I have been building in the open at github.com/DeepBlueDynamics/lume. Itâs a small CLI plus an MCP server, BSD-3 licensed, and built around a stubborn idea: when an agent asks a question, every step from query to evidence should be inspectable.
Lume indexes Markdown, source code, and PDFs (via a small Python extractor) and ranks over them with three independent primitives - field-aware BM25, dense GTR-T5 vectors via Shivvr, and a significance-scored entity graph. The lexical core and the graph run entirely on your machine; only the dense vectors call out, and that endpoint defaults to localhost. There is no opaque âsearch box that returns a rankingâ - every score has a name, a file, and a knob.
This post walks Lumeâs retrieval core end to end, with line-level references to the current tree. If youâre building agentic systems and tired of treating retrieval as a magic step, this is for you.
A few principles up front, because they explain the design:
- Local-first. Lexical search and the entity graph run entirely on your machine. Dense vectors are fetched from Shivvr through
SHIVVR_BASE_URL, which defaults to a local endpoint. - Layered, not monolithic. BM25, semantic, and graph are independent signals with their own scores. The blend is one line; each input is replaceable.
- Auditable. The engine prints what it pruned, what it ranked, and why it rejected the rest.
0. The unit of retrieval: a Section
Lume indexes Markdown, cut into sections at # headers (parse_markdown in src/bm25.rs:211). A Section (src/bm25.rs:106) is the atom everything ranks over:
pub struct Section {
pub title: String,
pub body: String,
pub line_number: usize,
pub filename: Option<String>,
pub entities: Vec<String>, // resolved named entities, for the graph
}
Title and body are separate fields with separate statistics - that distinction shows up immediately in scoring. The whole index lives in memory as a Bm25Index (src/bm25.rs:147): per-field term-frequency maps, document frequencies, field lengths, roaring-bitmap posting lists, prime/GĂśdel signature filters, and the entity posting lists that feed the graph.
1. Primitive: field-aware BM25
The lexical core is a field-aware BM25 with three selectable variants. The tuning defaults (Bm25Params in src/bm25.rs:125) are deliberately classic:
Self {
k1: 1.2,
b: 0.75,
delta: 1.0,
title_weight: 2.0,
body_weight: 1.0,
}
k1 controls term-frequency saturation; b controls length normalization. The one opinionated choice is title_weight: 2.0: a title hit contributes twice as much as a body hit before the coordination factor is applied. That is useful, but it can overweight chapter titles when a query token is broad. Treat it as a knob, not a law.
IDF is the standard smoothed form, floored at zero, and each termâs contribution is computed per field then summed with the field weights (calculate_bm25_term_score in src/bm25.rs:728):
let len_normalization = 1.0 - b + b * (doc_len / avgdl);
match variant {
SearchVariant::Classic => idf * (tf * (k1 + 1.0)) / (tf + k1 * len_normalization),
SearchVariant::Plus => idf * ((tf*(k1+1.0))/(tf + k1*len_normalization) + params.delta),
SearchVariant::L => {
let s = tf / len_normalization;
idf * (s*(k1+1.0))/(s + k1)
},
}
// total_score += title_weight * title_score + body_weight * body_score;
// (src/bm25.rs:635)
- Classic is textbook BM25.
- Plus adds a delta floor so a matched term never contributes nothing, countering BM25âs over-penalty of long documents.
- L moves length normalization inside the saturation, smoothing very long docs.
Lume runs Classic by default (src/main.rs:1430).
2. Two-stage pruning: roaring union, then GĂśdel signatures
You donât want to BM25-score all 1,926 sections of a book for every query. Lumeâs search (src/bm25.rs:445) is two-stage.
Stage 1 - candidate gather. Union the roaring-bitmap posting lists of the query terms. This is a handful of bitset ORs and instantly narrows the corpus to sections that contain any query term:
// src/bm25.rs:460
let mut candidate_set = MiniRoaring::new();
let mut first = true;
for q_tok in &query_tokens {
if let Some(list) = self.posting_lists.get(&q_tok.bytes) {
if first {
candidate_set = list.clone();
first = false;
} else {
candidate_set = candidate_set.union(list);
}
}
}
Stage 1b - GĂśdel tag-signature pruning. If the query tagger recognizes entities, each candidate section is verified against a prime-factored signature filter (PrimeFilter::test_tag_prime in src/fast_retrieval.rs:449, evaluated in src/bm25.rs:538). Each known tag output maps to a prime; a sectionâs tag signature is the product of its tag primes, so inclusion is checked by divisibility. Unknown query tags deliberately receive a dummy prime and fail closed. Candidates that fail are dropped as TagSignatureMismatch before heavier scoring.
Stage 2 - heavy scoring runs only on survivors. And the engine tells you the shape of the funnel on stderr (src/bm25.rs:557):
[Two-Stage Pruning] Pruned candidate space from 1926 to 302 (roaring generated: 609) sections in 54.70Âľs
Candidates: 609 Ranked: 302 Rejected: TagSignatureMismatch: 307
...
That accounting is not decoration - itâs the first thing you read when a query returns the wrong thing.
3. Query hygiene: stopwords and coordination
Two small primitives have outsized effects on quality.
Query-side stopword filtering (filter_query_stopwords in src/bm25.rs:98). Function and question words are stripped from the query only, never the index. Without it, âhow does Dantès know MercĂŠdèsâ is dominated by how/does/know, which match unrelated sections (a chapter literally titled âHow a GardenerâŚâ). The safety net: if every token is a stopword (âhow are youâ), the originals are kept so you still get results.
The coordination factor (src/bm25.rs:638). A document matching more of the distinct query terms should beat one that repeats a single common term. Lume multiplies the score by a coverage-based factor:
let coverage = matched_terms.len() as f64 / num_distinct as f64;
let coord = COORD_FLOOR + (1.0 - COORD_FLOOR) * coverage; // COORD_FLOOR = 0.5
total_score *= coord;
So a section matching all three query terms keeps 100% of its score; one matching a single term out of three keeps ~â
. For single-term queries coverage == 1.0, so ordinary lookups are untouched. Itâs a gentle nudge, not a hard AND.
4. Primitive: dense vectors (local GTR-T5)
Lexical search canât bridge a vocabulary gap - âstarved to deathâ vs âgastroenteritis.â Thatâs the semantic primitiveâs job.
Lume embeds text to 768-dimensional GTR-T5 vectors via Shivvr. The default base URL is http://localhost:8085 (src/hybrid.rs:777), and the request still requires a service token (src/hybrid.rs:784). Thereâs no dedicated embed endpoint, so embed_text (src/hybrid.rs:43) ingests into a throwaway scratch store and reads the vector straight off the response:
// 768-d GTR-T5 ("organize") vector, asserted on the way out:
if emb.len() != 768 {
return Err(format!("Expected 768-d GTR-T5 vector, got {}", emb.len()));
}
At index time, sections are pushed into a semantic session (ensure_semantic_session in src/hybrid.rs:581). The clever part is incrementality: each section gets a content hash (section_hash in src/hybrid.rs:407) used as the remote chunk id, with line numbers deliberately excluded - so moving a section doesnât force a re-embed. Re-indexing diffs by hash and tops up only what changed; a matching corpus fingerprint is a no-op. Query results are cached too (.lume-semantic-cache.json).
At query time, query_semantic_search (src/hybrid.rs:637) asks Shivvr for n=60 neighbors. If the index was built without semantic vectors, or the token is missing, search degrades cleanly to lexical BM25 and says so: an alpha > 0 request against a lexical-only index is told it got lexical-only (src/main.rs:1389).
5. Primitive: the Semantic Knowledge Graph (significance, not co-counts)
The third signal is structural. Lume builds an entity co-occurrence graph from pairwise roaring-bitset intersections - âcounting the counts of thingsâ (src/graph_search.rs:1). On its own thatâs a write-only export; graph_search turns it into a query-time ranking signal, fully local.
The subtle, important bit is how edges are weighted. Raw co-occurrence and even Jaccard reward promiscuous hubs - an entity that appears everywhere co-occurs with everything. Lume defaults to a significance score instead (cooccurrence_relatedness in src/semantic_mesh.rs:558): of n docs, A appears in a, B in b, both in k; compare observed k against the independence expectation E = a¡b/n as a z-score, log-compress it to preserve dynamic range, then squash with tanh:
let expected = a * b / n;
let variance = expected * (1.0 - a/n) * (1.0 - b/n);
let z = (k - expected) / variance.sqrt();
let compressed = z.signum() * (1.0 + z.abs()).ln(); // keep z=10 vs z=100 distinct
(compressed / 3.0).tanh() // -> [-1, 1]; negative = avoidance
The result lives in [-1, 1]: positive for real association, negative for avoidance (entities that co-occur below chance). This is Trey Graingerâs foreground-vs-background SKG relatedness, reduced to the pairwise case. A regression test pins the behavior: a true pairing (edmondâdantès) outranks a promiscuous hub even when the hub has perfect Jaccard overlap (src/graph_search.rs:305). You can still pick legacy Jaccard with --scoring jaccard; the edge_weight() (src/semantic_mesh.rs:532) selects, and clamps negative significance to 0 so avoidance never boosts.
The walk itself (compute_skg_scores in src/graph_search.rs:154):
- Resolve the queryâs entities by sliding n-grams longest-first (so âedmond dantesâ matches the stored âedmond dantèsâ),
max_ngram = 4. - Seed them at weight
1.0; walk one hop to the top-k (8) strongest neighbors, each carryingdecay * weight(decay = 0.5), taking the max across seeds - so a shared hub neighbor canât be summed into dominance. - Accumulate per-section mass from the entity posting lists and normalize to
[0, 1].
6. The blend: one multiplicative line
Three signals - lexical, semantic, graph - fuse in blend_hybrid_scores (src/hybrid.rs:673). The default is multiplicative, so the lexical match leads and the other two lift it:
hybrid = bm25 * (1.0 + alpha * semantic + beta * skg);
This keeps strong lexical hits on top while letting semantic and graph signals lift them. Recall expansion comes from two paths: semantic-only hits are admitted with bm25_score = 0, and SKG-only sections are admitted only when their normalized graph mass reaches SKG_EXPAND_MIN = 0.5 (src/graph_search.rs:22). In the lexical-only path those SKG-only sections are scored beta*skg, below comparable lexical matches (apply_skg_boost in src/graph_search.rs:218). Set beta = 0 and the graph term is removed.
Thereâs an alternate mode for when you want semantic/graph to be able to overtake lexical, gated behind LUME_BLEND_NORM=1, which puts all three on a comparable [0,1] scale (src/hybrid.rs:745):
// src/hybrid.rs:745
let hybrid_score = if normalize {
(bm25_score / bm25_max) + alpha * sem_score + beta * skg_score
} else if bm25_score > 0.0 {
bm25_score * (1.0 + alpha * sem_score + beta * skg_score)
} else {
sem_score + beta * skg_score // fallback for semantic-only recall expansion
};
7. The tuning surface: what each knob actually does
Everything above is exposed on lume search (handle_search in src/main.rs:1295). Hereâs the practical guide:
| Knob | Flag / env | Default | What it changes |
|---|---|---|---|
| Lexical â semantic | -a, --alpha / ALPHA |
0.5 in lume search |
0.0 = pure BM25; higher values increase the GTR-T5 term. Raise it when the answer uses different words than the query. |
| Graph boost | -g, --graph / GRAPH_ALPHA |
0.4 |
Weight of the SKG term. Raise to pull in entity-related sections; 0 disables the graph (pure lexical+semantic). |
| Edge scoring | --scoring |
relatedness |
relatedness (significance, hub-resistant) vs jaccard (raw overlap). Use significance unless youâre reproducing old numbers. |
| Spell correction | -c, --spell-check |
off | Corrects each query word against the index vocabulary (correct_query in src/main.rs:1273) before searching. |
| BM25 length / saturation | Bm25Params |
k1=1.2, b=0.75 |
Lower b for corpora with wildly varying section lengths; raise k1 to reward repeated terms more. |
| Field weight | title_weight |
2.0 |
How much a title hit beats a body hit. Lower it if chapter/section titles are hijacking results. |
| Query inversion (debug) | LUME_QUERY_INVERSION=1 |
off | Round-trips the queryâs GTR-T5 vector back to text (invert_vector in src/inversion.rs:30) so you can inspect the embedding. Costs an extra embed plus invert request and does not affect ranking. |
A worked example of the inversion trick: if semantic search keeps missing, set LUME_QUERY_INVERSION=1 and read what your query embeds back to. If âhow does the prisoner escapeâ inverts to something about gardens, your embedding isnât anchored on âescapeâ - reword, or lean lexical with a lower alpha.
8. Case study: the retrieval bug that confused the agent
To understand how these primitives interact, consider a real query run during testing over the full text of The Count of Monte Cristo: âHow does Edmond Dantèsâs father die?â
On Lumeâs initial run, the answering agent returned: âThe provided passages do not contain information regarding how Edmond Dantèsâs father died.â However, the death is fully detailed in the corpus. We traced this failure to three compounding issues at the query and retrieval parameter boundary:
- Proper-noun bias. The query planner generated keyword queries like
"Dantès father death". In Chapter 26, the father is referred to as âthe old manâ or âthe elder Dantèsâ, and his death is described without the word âdeathâ (âdied of starvationâ). Meanwhile, the literal chapter title âFather and Sonâ and broad entity matches onDantèspulled retrieval toward passages about the father while alive, pushing the target passage down the ranking.
Comments
No comments yet. Start the discussion.