Your LLM prompt doesn't fit? Pack it by priority (zero dependencies)
DEV Community Grade 10

Your LLM prompt doesn't fit? Pack it by priority (zero dependencies)

Every RAG app and agent eventually hits the same wall: you have more stuff than fits in the model's context window — a system prompt, chat history, retrieved documents, tool output — and a fixed token budget. The usual "fix" is to truncate the whole blob at the end. Which means you randomly chop off whatever happened to be last: sometimes a doc, sometimes half your system prompt. You drop the wrong things. I got tired of rewriting that logic in every project, so I built contextcram — a tiny, zero-dependency library that treats this as a prioritized packing problem . The idea Give each piece of context a priority and a strategy for what should happen if it doesn't fit. Set a token budget. contextcram assembles the largest in-budget context that keeps the important parts. pip install contextcram from contextcram import Packer ctx = ( Packer ( budget = 8000 ) . add ( system_prompt , priority = " required " ) # never dropped . add ( chat_history , priority = " high " , strategy = " trim " ) # drop oldest turns . add ( retrieved_docs , priority = " medium " , strategy = " drop " ) # all-or-nothing . add ( tool_output , priority = " low " , strategy = " truncate " ) # cut to fit . fit () ) print ( ctx . text ) # the assembled, in-budget context print ( ctx . used_tokens ) # e.g. 7840 print ( ctx . dropped_names ) # what didn't make the cut Strategies When an optional item doesn't fully fit, its strategy decides what happens: Strategy Behavior drop Include it whole, or not at all truncate Cut from the end, keep the head (default) truncate_head Cut from the start, keep the tail trim For lists (e.g. messages): drop oldest first required items are always kept; if they alone blow the budget, you get a clear BudgetExceeded error instead of a silently mangled prompt. The part I actually wanted: model-aware budgets + room to answer Two recurring annoyances solved in one line: from contextcram import Packer # Budget pulled from the model; hold back 2k tokens for the reply packer = Packer ( model = " gpt-4o " , reserve = 2000 ) print ( packer . full_budget ) # 128000 print ( packer . budget ) # 126000<- what you actually pack into reserve= kills the classic "the prompt fit, but there's no room left for the model to answer" bug. Tie it to your max_tokens and you can't get it wrong. Real-world: with LangChain from contextcram import Packer from langchain_openai import ChatOpenAI from langchain_core.messages import SystemMessage , HumanMessage llm = ChatOpenAI ( model = " gpt-4o " ) docs = [ d . page_content for d in retriever . invoke ( question )] history = [ f " { m . type } : { m . content } " for m in memory . messages ] ctx = ( Packer ( model = " gpt-4o " , reserve = 1500 ) . add ( SYSTEM_PROMPT , priority = " required " ) . add ( history , priority = " high " , strategy = " trim " ) . add ( " \n\n " . join ( docs ), priority = " medium " , strategy = " drop " ) . fit () ) response = llm . invoke ([ SystemMessage ( ctx . text ), HumanMessage ( question )]) Need exact token counts? Pass tokenizer=tiktoken_tokenizer("gpt-4o") , or wrap any tokenizer (Hugging Face, llama.cpp) with a one-line CallableTokenizer . The default is a fast characters-per-token heuristic so there are no required dependencies . How is this different from Priompt / Prompt Poet? Honest answer: the concept isn't new. Priompt (and its Python port) and Character.AI's Prompt Poet do priority-based context assembly too — and they're more powerful (component models, cache-aware truncation, templating). contextcram deliberately trades features for simplicity and zero dependencies : Pure stdlib — no Jinja2, no YAML, no heavy SDK. A 3-line API: Packer(...).add(...).fit() . Framework-agnostic — LangChain, LlamaIndex, raw SDKs, or nothing. If you want the smallest possible helper that does one thing — fit prioritized pieces into a budget — this is it. Try it pip install contextcram ⭐ Repo: https://github.com/Waelr1985/contextcram 📦 PyPI: https://pypi.org/project/contextcram/ It's MIT, fully typed ( mypy --strict ), tested across Python 3.10–3.13. I'd genuinely love feedback on the API and the default strategies — open an issue or drop a comment.

Every RAG app and agent eventually hits the same wall: you have more stuff than fits in the model's context window — a system prompt, chat history, retrieved documents, tool output — and a fixed token budget. The usual "fix" is to truncate the whole blob at the end. Which means you randomly chop off whatever happened to be last: sometimes a doc, sometimes half your system prompt. You drop the wrong things. I got tired of rewriting that logic in every project, so I built contextcram — a tiny, zero-dependency library that treats this as a prioritized packing problem. The idea Give each piece of context a priority and a strategy for what should happen if it doesn't fit. Set a token budget. contextcram assembles the largest in-budget context that keeps the important parts. pip install contextcram from contextcram import Packer ctx = ( Packer(budget=8000) .add(system_prompt, priority="required") # never dropped .add(chat_history, priority="high", strategy="trim") # drop oldest turns .add(retrieved_docs, priority="medium", strategy="drop") # all-or-nothing .add(tool_output, priority="low", strategy="truncate") # cut to fit .fit() ) print(ctx.text) # the assembled, in-budget context print(ctx.used_tokens) # e.g. 7840 print(ctx.dropped_names) # what didn't make the cut Strategies When an optional item doesn't fully fit, its strategy decides what happens: | Strategy | Behavior | |---|---| drop | Include it whole, or not at all | truncate | Cut from the end, keep the head (default) | truncate_head | Cut from the start, keep the tail | trim | For lists (e.g. messages): drop oldest first | required items are always kept; if they alone blow the budget, you get a clear BudgetExceeded error instead of a silently mangled prompt. The part I actually wanted: model-aware budgets + room to answer Two recurring annoyances solved in one line: from contextcram import Packer # Budget pulled from the model; hold back 2k tokens for the reply packer = Packer(model="gpt-4o", reserve=2000) print(packer.full_budget) # 128000 print(packer.budget) # 126000 <- what you actually pack into reserve= kills the classic "the prompt fit, but there's no room left for the model to answer" bug. Tie it to your max_tokens and you can't get it wrong. Real-world: with LangChain from contextcram import Packer from langchain_openai import ChatOpenAI from langchain_core.messages import SystemMessage, HumanMessage llm = ChatOpenAI(model="gpt-4o") docs = [d.page_content for d in retriever.invoke(question)] history = [f"{m.type}: {m.content}" for m in memory.messages] ctx = ( Packer(model="gpt-4o", reserve=1500) .add(SYSTEM_PROMPT, priority="required") .add(history, priority="high", strategy="trim") .add("\n\n".join(docs), priority="medium", strategy="drop") .fit() ) response = llm.invoke([SystemMessage(ctx.text), HumanMessage(question)]) Need exact token counts? Pass tokenizer=tiktoken_tokenizer("gpt-4o") , or wrap any tokenizer (Hugging Face, llama.cpp) with a one-line CallableTokenizer . The default is a fast characters-per-token heuristic so there are no required dependencies. How is this different from Priompt / Prompt Poet? Honest answer: the concept isn't new. Priompt (and its Python port) and Character.AI's Prompt Poet do priority-based context assembly too — and they're more powerful (component models, cache-aware truncation, templating). contextcram deliberately trades features for simplicity and zero dependencies: - Pure stdlib — no Jinja2, no YAML, no heavy SDK. - A 3-line API: Packer(...).add(...).fit() . - Framework-agnostic — LangChain, LlamaIndex, raw SDKs, or nothing. If you want the smallest possible helper that does one thing — fit prioritized pieces into a budget — this is it. Try it pip install contextcram It's MIT, fully typed (mypy --strict ), tested across Python 3.10–3.13. I'd genuinely love feedback on the API and the default strategies — open an issue or drop a comment. Top comments (0)

Comments

No comments yet. Start the discussion.