DEV Community Grade 10 2h ago

Your LLM prompt doesn't fit? Pack it by priority (zero dependencies)

Every RAG app and agent eventually hits the same wall: you have more stuff than fits in the model's context window — a system prompt, chat history, retrieved documents, tool output — and a fixed token budget. The usual "fix" is to truncate the whole blob at the end. Which means you randomly chop off whatever happened to be last: sometimes a doc, sometimes half your system prompt. You drop the wrong things. I got tired of rewriting that logic in every project, so I built contextcram — a tiny, zero-dependency library that treats this as a prioritized packing problem . The idea Give each piece of context a priority and a strategy for what should happen if it doesn't fit. Set a token budget. contextcram assembles the largest in-budget context that keeps the important parts. pip install contextcram from contextcram import Packer ctx = ( Packer ( budget = 8000 ) . add ( system_prompt , priority = " required " ) # never dropped . add ( chat_history , priority = " high " , strategy = " trim " ) # drop oldest turns . add ( retrieved_docs , priority = " medium " , strategy = " drop " ) # all-or-nothing . add ( tool_output , priority = " low " , strategy = " truncate " ) # cut to fit . fit () ) print ( ctx . text ) # the assembled, in-budget context print ( ctx . used_tokens ) # e.g. 7840 print ( ctx . dropped_names ) # what didn't make the cut Strategies When an optional item doesn't fully fit, its strategy decides what happens: Strategy Behavior drop Include it whole, or not at all truncate Cut from the end, keep the head (default) truncate_head Cut from the start, keep the tail trim For lists (e.g. messages): drop oldest first required items are always kept; if they alone blow the budget, you get a clear BudgetExceeded error instead of a silently mangled prompt. The part I actually wanted: model-aware budgets + room to answer Two recurring annoyances solved in one line: from contextcram import Packer # Budget pulled from the model; hold back 2k tokens for the reply packer = Packer ( model = " gpt-4o " , reserve = 2000 ) print ( packer . full_budget ) # 128000 print ( packer . budget ) # 126000<- what you actually pack into reserve= kills the classic "the prompt fit, but there's no room left for the model to answer" bug. Tie it to your max_tokens and you can't get it wrong. Real-world: with LangChain from contextcram import Packer from langchain_openai import ChatOpenAI from langchain_core.messages import SystemMessage , HumanMessage llm = ChatOpenAI ( model = " gpt-4o " ) docs = [ d . page_content for d in retriever . invoke ( question )] history = [ f " { m . type } : { m . content } " for m in memory . messages ] ctx = ( Packer ( model = " gpt-4o " , reserve = 1500 ) . add ( SYSTEM_PROMPT , priority = " required " ) . add ( history , priority = " high " , strategy = " trim " ) . add ( " \n\n " . join ( docs ), priority = " medium " , strategy = " drop " ) . fit () ) response = llm . invoke ([ SystemMessage ( ctx . text ), HumanMessage ( question )]) Need exact token counts? Pass tokenizer=tiktoken_tokenizer("gpt-4o") , or wrap any tokenizer (Hugging Face, llama.cpp) with a one-line CallableTokenizer . The default is a fast characters-per-token heuristic so there are no required dependencies . How is this different from Priompt / Prompt Poet? Honest answer: the concept isn't new. Priompt (and its Python port) and Character.AI's Prompt Poet do priority-based context assembly too — and they're more powerful (component models, cache-aware truncation, templating). contextcram deliberately trades features for simplicity and zero dependencies : Pure stdlib — no Jinja2, no YAML, no heavy SDK. A 3-line API: Packer(...).add(...).fit() . Framework-agnostic — LangChain, LlamaIndex, raw SDKs, or nothing. If you want the smallest possible helper that does one thing — fit prioritized pieces into a budget — this is it. Try it pip install contextcram ⭐ Repo: https://github.com/Waelr1985/contextcram 📦 PyPI: https://pypi.org/project/contextcram/ It's MIT, fully typed ( mypy --strict ), tested across Python 3.10–3.13. I'd genuinely love feedback on the API and the default strategies — open an issue or drop a comment.

Every RAG app and agent eventually hits the same wall: you have more stuff than fits in the model's context window — a system prompt, chat history, retrieved documents, tool output — and a fixed token budget. The usual "fix" is to truncate the whole blob at the end. Which means you randomly chop off whatever happened to be last: sometimes a doc, sometimes half your system prompt. You drop the wrong things. I got tired of rewriting that logic in every project, so I built contextcram — a tiny, zero-dependency library that treats this as a prioritized packing problem. The idea Give each piece of context a priority and a strategy for what should happen if it doesn't fit. Set a token budget. contextcram assembles the largest in-budget context that keeps the important parts. pip install contextcram from contextcram import Packer ctx = ( Packer(budget=8000) .add(system_prompt, priority="required") # never dropped .add(chat_history, priority="high", strategy="trim") # drop oldest turns .add(retrieved_docs, priority="medium", strategy="drop") # all-or-nothing .add(tool_output, priority="low", strategy="truncate") # cut to fit .fit() ) print(ctx.text) # the assembled, in-budget context print(ctx.used_tokens) # e.g. 7840 print(ctx.dropped_names) # what didn't make the cut Strategies When an optional item doesn't fully fit, its strategy decides what happens: | Strategy | Behavior | |---|---| drop | Include it whole, or not at all | truncate | Cut from the end, keep the head (default) | truncate_head | Cut from the start, keep the tail | trim | For lists (e.g. messages): drop oldest first | required items are always kept; if they alone blow the budget, you get a clear BudgetExceeded error instead of a silently mangled prompt. The part I actually wanted: model-aware budgets + room to answer Two recurring annoyances solved in one line: from contextcram import Packer # Budget pulled from the model; hold back 2k tokens for the reply packer = Packer(model="gpt-4o", reserve=2000) print(packer.full_budget) # 128000 print(packer.budget) # 126000 <- what you actually pack into reserve= kills the classic "the prompt fit, but there's no room left for the model to answer" bug. Tie it to your max_tokens and you can't get it wrong. Real-world: with LangChain from contextcram import Packer from langchain_openai import ChatOpenAI from langchain_core.messages import SystemMessage, HumanMessage llm = ChatOpenAI(model="gpt-4o") docs = [d.page_content for d in retriever.invoke(question)] history = [f"{m.type}: {m.content}" for m in memory.messages] ctx = ( Packer(model="gpt-4o", reserve=1500) .add(SYSTEM_PROMPT, priority="required") .add(history, priority="high", strategy="trim") .add("\n\n".join(docs), priority="medium", strategy="drop") .fit() ) response = llm.invoke([SystemMessage(ctx.text), HumanMessage(question)]) Need exact token counts? Pass tokenizer=tiktoken_tokenizer("gpt-4o") , or wrap any tokenizer (Hugging Face, llama.cpp) with a one-line CallableTokenizer . The default is a fast characters-per-token heuristic so there are no required dependencies. How is this different from Priompt / Prompt Poet? Honest answer: the concept isn't new. Priompt (and its Python port) and Character.AI's Prompt Poet do priority-based context assembly too — and they're more powerful (component models, cache-aware truncation, templating). contextcram deliberately trades features for simplicity and zero dependencies: - Pure stdlib — no Jinja2, no YAML, no heavy SDK. - A 3-line API: Packer(...).add(...).fit() . - Framework-agnostic — LangChain, LlamaIndex, raw SDKs, or nothing. If you want the smallest possible helper that does one thing — fit prioritized pieces into a budget — this is it. Try it pip install contextcram It's MIT, fully typed (mypy --strict ), tested across Python 3.10–3.13. I'd genuinely love feedback on the API and the default strategies — open an issue or drop a comment. Top comments (0)

Read on DEV Community ↗ ← Back to News

Your LLM prompt doesn't fit? Pack it by priority (zero dependencies)

Comments