Token Budget Management
Token Budget Management
Token budgets give you fine-grained control over how tokens are allocated
across different context sources. Instead of a single max_tokens cap
where all sources compete, you can assign dedicated portions to system
prompts, memory, retrieval, tools, and other sources.
Why Token Budgets?
Without budgets, all ContextItem objects compete for the same token pool.
A large retrieval result can crowd out conversation history. A verbose
system prompt can leave no room for RAG context.
Token budgets solve this by:
- Allocating per-source caps -- guarantee each source gets its share.
- Reserving tokens -- hold back tokens for the LLM's response.
- Defining overflow strategies -- control what happens when a source exceeds its cap.
- Tracking shared pool usage -- diagnostics show how tokens flow.
Core Models
TokenBudget
The top-level budget model defines the total token capacity and how it is divided.
from anchor import TokenBudget, BudgetAllocation, SourceType
budget = TokenBudget(
total_tokens=8192,
reserve_tokens=1200, # Hold back for the LLM response
allocations=[
BudgetAllocation(source=SourceType.SYSTEM, max_tokens=800, priority=10),
BudgetAllocation(source=SourceType.MEMORY, max_tokens=800, priority=8),
BudgetAllocation(source=SourceType.RETRIEVAL, max_tokens=3200, priority=5),
],
)| Field | Type | Description |
|---|---|---|
total_tokens | int | Total token budget (must be > 0) |
allocations | list[BudgetAllocation] | Per-source allocations |
reserve_tokens | int | Tokens reserved for the LLM response (default: 0) |
The model validates that sum(allocations) + reserve_tokens <= total_tokens.
If the sum exceeds the total, a ValueError is raised at construction time.
BudgetAllocation
Defines how many tokens a single source type may consume.
from anchor import BudgetAllocation, SourceType
alloc = BudgetAllocation(
source=SourceType.RETRIEVAL,
max_tokens=3200,
priority=5,
overflow_strategy="truncate", # or "drop"
)| Field | Type | Default | Description |
|---|---|---|---|
source | SourceType | -- | The source type this allocation applies to |
max_tokens | int | -- | Maximum tokens for this source (must be > 0) |
priority | int | 5 | Priority used for ordering (1--10) |
overflow_strategy | "truncate" | "drop" | "truncate" | What to do when the source exceeds its cap |
Overflow Strategies
When a source produces more items than its allocation allows, the overflow strategy determines what happens.
Truncate (default)
Items are sorted by (-priority, -score). Items are kept until the cap is
reached; the rest overflow.
Source "retrieval" cap: 2000 tokens
Item A (800 tokens, score=0.95) --> KEPT (800 / 2000)
Item B (700 tokens, score=0.85) --> KEPT (1500 / 2000)
Item C (600 tokens, score=0.70) --> OVERFLOW (would exceed 2000)
Item D (400 tokens, score=0.60) --> OVERFLOWDrop
If the total tokens for the source exceed the cap, all items for that source are dropped. This is useful when partial retrieval context is worse than no retrieval context.
Source "retrieval" cap: 2000 tokens
Total items: 2500 tokens --> ALL DROPPED (exceeds cap)[!CAUTION] Drop strategy Use
"drop"only when your application requires all-or-nothing behavior for a source. In most cases,"truncate"is the safer choice.
Reserve Tokens
The reserve_tokens field subtracts tokens from the effective max_tokens
of the pipeline. This guarantees space for the LLM's response.
from anchor import ContextPipeline, TokenBudget
budget = TokenBudget(total_tokens=8192, reserve_tokens=1200)
pipeline = ContextPipeline(max_tokens=8192).with_budget(budget)
# Effective context window = 8192 - 1200 = 6992 tokensThe pipeline will raise a PipelineExecutionError if reserve_tokens >= max_tokens (leaving zero or negative space for context).
Shared Pool
Tokens not explicitly allocated to any source form the shared pool. Sources without an allocation compete for this pool during window assembly.
budget = TokenBudget(
total_tokens=8192,
reserve_tokens=1200, # 1200
allocations=[
BudgetAllocation(source=SourceType.SYSTEM, max_tokens=800), # 800
BudgetAllocation(source=SourceType.RETRIEVAL, max_tokens=3200), # 3200
],
)
print(budget.shared_pool) # 8192 - 1200 - 800 - 3200 = 2992Items from sources with no explicit allocation (e.g., SourceType.MEMORY,
SourceType.CONVERSATION, SourceType.USER in the example above) draw
from the shared pool.
The get_allocation() method returns the per-source cap if one exists, or
the shared pool size as a fallback:
print(budget.get_allocation(SourceType.SYSTEM)) # 800
print(budget.get_allocation(SourceType.RETRIEVAL)) # 3200
print(budget.get_allocation(SourceType.MEMORY)) # 2992 (shared pool)Preset Factories
Three factory functions provide sensible defaults for common application
types. Each accepts a max_tokens parameter and returns a configured
TokenBudget.
default_chat_budget
Optimized for conversational applications with moderate retrieval.
from anchor import default_chat_budget
budget = default_chat_budget(max_tokens=8192)| Source | Allocation | Percentage |
|---|---|---|
| System | 819 | 10% |
| Memory | 819 | 10% |
| Conversation | 1638 | 20% |
| Retrieval | 2048 | 25% |
| Reserve | 1228 | 15% |
| Shared pool | -- | 20% |
default_rag_budget
Optimized for RAG-heavy applications where retrieval dominates.
from anchor import default_rag_budget
budget = default_rag_budget(max_tokens=8192)| Source | Allocation | Percentage |
|---|---|---|
| System | 819 | 10% |
| Memory | 409 | 5% |
| Conversation | 819 | 10% |
| Retrieval | 3276 | 40% |
| Reserve | 1228 | 15% |
| Shared pool | -- | 20% |
default_agent_budget
Optimized for agentic applications with tool usage.
from anchor import default_agent_budget
budget = default_agent_budget(max_tokens=8192)| Source | Allocation | Percentage |
|---|---|---|
| System | 1228 | 15% |
| Memory | 819 | 10% |
| Conversation | 1228 | 15% |
| Retrieval | 1638 | 20% |
| Tool | 1228 | 15% |
| Reserve | 1228 | 15% |
| Shared pool | -- | 10% |
[!TIP] Custom budgets The presets are a starting point. For production workloads, construct a
TokenBudgetdirectly with allocations tuned to your application's data distribution.
Using Budgets with the Pipeline
Attach a budget to the pipeline with .with_budget():
from anchor import ContextPipeline, default_rag_budget
budget = default_rag_budget(max_tokens=8192)
pipeline = (
ContextPipeline(max_tokens=8192)
.with_budget(budget)
.add_system_prompt("You are a helpful assistant.")
)
result = pipeline.build("What is context engineering?")You can also pass the budget directly to the constructor:
pipeline = ContextPipeline(max_tokens=8192, budget=budget)Budget Diagnostics
When a budget is configured, the pipeline's diagnostics include extra fields that track how tokens were spent:
result = pipeline.build("What is context engineering?")
d = result.diagnostics
# Tokens used per source type
print(d.get("token_usage_by_source"))
# e.g. {"system": 45, "retrieval": 1200, "memory": 300}
# Tokens used by sources without explicit allocations
print(d.get("shared_pool_usage"))
# e.g. 300
# Items dropped because a source exceeded its cap
print(d.get("budget_overflow_by_source"))
# e.g. {"retrieval": 3} -- 3 retrieval items were dropped[!NOTE] Overflow vs window overflow Budget overflow happens during per-source cap enforcement (Stage 4a). Window overflow happens when total items still exceed
max_tokensafter budget filtering (Stage 4b). Both are tracked in diagnostics.
Source Types
The SourceType enum defines the valid source categories:
| Value | Description |
|---|---|
SourceType.SYSTEM | System prompts and instructions |
SourceType.MEMORY | Persistent memory entries |
SourceType.CONVERSATION | Conversation history turns |
SourceType.RETRIEVAL | RAG / search results |
SourceType.TOOL | Tool or function call outputs |
SourceType.USER | Direct user-provided context |
See Also
- Architecture -- how budgets integrate into the pipeline
- Protocols -- protocol-based plugin system
- Pipeline Guide -- practical pipeline examples