How LLMs Now Monitor and Cut Their Own Token Spend
You have seen this loop before. An agent starts a "simple" task, say scrape listings, refactor a repo, research a market, or whatever. It fails, it retries, it re-reads context, it apologizes and tries all over again. Twenty minutes in and the dashboard shows six figures of tokens and zero useful outputs or deliverables. The model did not misbehave on purpose. The orchestrator never had a hard budget gate with an ROI in mind.
Skillware v0.4.0 ships a new skill for exactly that gap: monitoring/token_limiter. It lets you monitor and limit any agent's token budget in real time - Gemini, Claude, OpenAI, DeepSeek, Ollama, custom Python loops, you name it. Same skill, same JSON, any runtime.
What Skillware is in a nutshell
Skillware is an open registry of installable agent capabilities. Each skill is a bundle:
skill.py- deterministic Python (execute()returns JSON)instructions.md- when the model should call the toolmanifest.yaml- schema, constitution, issuer- Tests and docs - shipped in the wheel
You load by ID, adapt for your provider, call execute() on tool use. The model decides when, the skill decides how, predictably, every time. That split matters for budget control. You do not want the LLM guessing whether it is "allowed" to spend more tokens. You want a small, auditable function that answers: continue, warn, or stop.
Meet the Token Limiter
This skill is a budget gate, not a kill switch wired into OpenAI or Anthropic. After each model turn, your host loop passes cumulative usage. The skill returns one of three actions:
| Action | Meaning |
|---|---|
CONTINUE |
Under the soft threshold - keep going |
WARN |
Approaching the limit (default 80%) - tighten scope |
FORCE_TERMINATE |
Hard ceiling hit - stop the loop |
Important nuance: the skill does not cancel API sessions or kill processes. It returns a structured decision. Your orchestrator must act on it. That is by design - Skillware skills stay portable and provider-neutral. No skill-specific API keys. No network calls. Pure Python math on numbers you supply.
How it works in a real loop
Picture a scrape task with a 100,000 token ceiling.
- Agent runs turn 1 → host adds usage → calls
token_limiter - Turn 2, turn 3 - same pattern
- At 85k tokens →
WARN - At 105k →
FORCE_TERMINATE→ host breaks the loop and surfaces the reason
Minimal integration:
from skillware.core.loader import SkillLoader
bundle = SkillLoader.load_skill("monitoring/token_limiter")
skill = bundle["module"].TokenLimiterSkill()
result = skill.execute({
"task_id": "scrape_listings_101",
"current_token_count": 125_000,
"max_allowed_tokens": 100_000,
"model_id": "gpt-4o",
})
if result["action"] == "FORCE_TERMINATE":
raise RuntimeError(result["reason"])
The host tracks cumulative current_token_count from whatever provider you use - usage metadata from the API, a local tokenizer, or your own accounting layer. The skill does not read billing dashboards for you.
Optional model_id maps to bundled list prices for indicative USD in the response. Handy for ops dashboards; not invoice-grade. Unknown models fall back to a blended rate with a warning in the payload.
Optional turn_id makes retries idempotent: same turn, same counts, same decision - no double-penalty if your loop replays a step.
Architecture: Mind, Body, and a new category
The skill lives under a new monitoring/ category - room for more observability skills later.
budget.py- pure evaluation logic (thresholds, cost estimate, ROI scaffold for v2)skill.py- thinBaseSkillwrapper, in-memory turn cacheinstructions.md- tells the agent: call this every turn; stop when you seeFORCE_TERMINATEdata/model_pricing.json- indicative rates for common models
v1 enforces token limits only. ROI fields (expected_outcome, outcome_delivered, roi_value_usd) are accepted as scaffold for v2 - outcome-aware gates later, without breaking the v1 contract today.
Runnable examples ship in the repo: local loop simulation (token_limiter_loop.py), plus Gemini and Claude harnesses.
Install and try:
pip install skillware
Catalog page: docs/skills/token_limiter.md
Chain it with other skills
Budget control pairs naturally with optimization/prompt_rewriter - compress bloated context before the main call, then cap spend during the loop. Less waste in, hard ceiling out.
Running agents against contracts or wallets? Screen first with finance/wallet_screening, execute with defi/evm_tx_handler, and keep token_limiter in the outer loop so a stuck DeFi agent cannot burn budget forever. Three skills, one NLP-driven pipeline, any supported model.
Conclusion
Autonomous agents without token guardrails are expensive experiments. monitoring/token_limiter gives you a deterministic, testable answer to a simple question after every turn: are we still within budget? It ships in Skillware v0.4.0 today. Load it once, wire it into your loop, and stop paying for agents that retry themselves into oblivion.
Links
- Skillware Website
- Skillware on GitHub
- monitoring/token_limiter source
- v0.4.0 release notes
- Skill library
- Agent loops guide
Questions, issues, or skill ideas welcome in the repo. If you are building agent infra, start with a budget gate - your finance team will thank you later.
Comments
No comments yet. Start the discussion.