Astro Intelligence
AnchorConcepts

Token Budget Management

Token Budget Management

Token budgets give you fine-grained control over how tokens are allocated across different context sources. Instead of a single max_tokens cap where all sources compete, you can assign dedicated portions to system prompts, memory, retrieval, tools, and other sources.

Why Token Budgets?

Without budgets, all ContextItem objects compete for the same token pool. A large retrieval result can crowd out conversation history. A verbose system prompt can leave no room for RAG context.

Token budgets solve this by:

  • Allocating per-source caps -- guarantee each source gets its share.
  • Reserving tokens -- hold back tokens for the LLM's response.
  • Defining overflow strategies -- control what happens when a source exceeds its cap.
  • Tracking shared pool usage -- diagnostics show how tokens flow.

Core Models

TokenBudget

The top-level budget model defines the total token capacity and how it is divided.

from anchor import TokenBudget, BudgetAllocation, SourceType

budget = TokenBudget(
    total_tokens=8192,
    reserve_tokens=1200,  # Hold back for the LLM response
    allocations=[
        BudgetAllocation(source=SourceType.SYSTEM, max_tokens=800, priority=10),
        BudgetAllocation(source=SourceType.MEMORY, max_tokens=800, priority=8),
        BudgetAllocation(source=SourceType.RETRIEVAL, max_tokens=3200, priority=5),
    ],
)
FieldTypeDescription
total_tokensintTotal token budget (must be > 0)
allocationslist[BudgetAllocation]Per-source allocations
reserve_tokensintTokens reserved for the LLM response (default: 0)

The model validates that sum(allocations) + reserve_tokens <= total_tokens. If the sum exceeds the total, a ValueError is raised at construction time.

BudgetAllocation

Defines how many tokens a single source type may consume.

from anchor import BudgetAllocation, SourceType

alloc = BudgetAllocation(
    source=SourceType.RETRIEVAL,
    max_tokens=3200,
    priority=5,
    overflow_strategy="truncate",  # or "drop"
)
FieldTypeDefaultDescription
sourceSourceType--The source type this allocation applies to
max_tokensint--Maximum tokens for this source (must be > 0)
priorityint5Priority used for ordering (1--10)
overflow_strategy"truncate" | "drop""truncate"What to do when the source exceeds its cap

Overflow Strategies

When a source produces more items than its allocation allows, the overflow strategy determines what happens.

Truncate (default)

Items are sorted by (-priority, -score). Items are kept until the cap is reached; the rest overflow.


Source "retrieval" cap: 2000 tokens

Item A (800 tokens, score=0.95) --> KEPT     (800 / 2000)
Item B (700 tokens, score=0.85) --> KEPT     (1500 / 2000)
Item C (600 tokens, score=0.70) --> OVERFLOW (would exceed 2000)
Item D (400 tokens, score=0.60) --> OVERFLOW

Drop

If the total tokens for the source exceed the cap, all items for that source are dropped. This is useful when partial retrieval context is worse than no retrieval context.


Source "retrieval" cap: 2000 tokens

Total items: 2500 tokens --> ALL DROPPED (exceeds cap)

[!CAUTION] Drop strategy Use "drop" only when your application requires all-or-nothing behavior for a source. In most cases, "truncate" is the safer choice.

Reserve Tokens

The reserve_tokens field subtracts tokens from the effective max_tokens of the pipeline. This guarantees space for the LLM's response.

from anchor import ContextPipeline, TokenBudget

budget = TokenBudget(total_tokens=8192, reserve_tokens=1200)
pipeline = ContextPipeline(max_tokens=8192).with_budget(budget)
# Effective context window = 8192 - 1200 = 6992 tokens

The pipeline will raise a PipelineExecutionError if reserve_tokens >= max_tokens (leaving zero or negative space for context).

Shared Pool

Tokens not explicitly allocated to any source form the shared pool. Sources without an allocation compete for this pool during window assembly.

budget = TokenBudget(
    total_tokens=8192,
    reserve_tokens=1200,          # 1200
    allocations=[
        BudgetAllocation(source=SourceType.SYSTEM, max_tokens=800),    # 800
        BudgetAllocation(source=SourceType.RETRIEVAL, max_tokens=3200), # 3200
    ],
)
print(budget.shared_pool)  # 8192 - 1200 - 800 - 3200 = 2992

Items from sources with no explicit allocation (e.g., SourceType.MEMORY, SourceType.CONVERSATION, SourceType.USER in the example above) draw from the shared pool.

The get_allocation() method returns the per-source cap if one exists, or the shared pool size as a fallback:

print(budget.get_allocation(SourceType.SYSTEM))     # 800
print(budget.get_allocation(SourceType.RETRIEVAL))   # 3200
print(budget.get_allocation(SourceType.MEMORY))      # 2992 (shared pool)

Preset Factories

Three factory functions provide sensible defaults for common application types. Each accepts a max_tokens parameter and returns a configured TokenBudget.

default_chat_budget

Optimized for conversational applications with moderate retrieval.

from anchor import default_chat_budget

budget = default_chat_budget(max_tokens=8192)
SourceAllocationPercentage
System81910%
Memory81910%
Conversation163820%
Retrieval204825%
Reserve122815%
Shared pool--20%

default_rag_budget

Optimized for RAG-heavy applications where retrieval dominates.

from anchor import default_rag_budget

budget = default_rag_budget(max_tokens=8192)
SourceAllocationPercentage
System81910%
Memory4095%
Conversation81910%
Retrieval327640%
Reserve122815%
Shared pool--20%

default_agent_budget

Optimized for agentic applications with tool usage.

from anchor import default_agent_budget

budget = default_agent_budget(max_tokens=8192)
SourceAllocationPercentage
System122815%
Memory81910%
Conversation122815%
Retrieval163820%
Tool122815%
Reserve122815%
Shared pool--10%

[!TIP] Custom budgets The presets are a starting point. For production workloads, construct a TokenBudget directly with allocations tuned to your application's data distribution.

Using Budgets with the Pipeline

Attach a budget to the pipeline with .with_budget():

from anchor import ContextPipeline, default_rag_budget

budget = default_rag_budget(max_tokens=8192)
pipeline = (
    ContextPipeline(max_tokens=8192)
    .with_budget(budget)
    .add_system_prompt("You are a helpful assistant.")
)
result = pipeline.build("What is context engineering?")

You can also pass the budget directly to the constructor:

pipeline = ContextPipeline(max_tokens=8192, budget=budget)

Budget Diagnostics

When a budget is configured, the pipeline's diagnostics include extra fields that track how tokens were spent:

result = pipeline.build("What is context engineering?")
d = result.diagnostics

# Tokens used per source type
print(d.get("token_usage_by_source"))
# e.g. {"system": 45, "retrieval": 1200, "memory": 300}

# Tokens used by sources without explicit allocations
print(d.get("shared_pool_usage"))
# e.g. 300

# Items dropped because a source exceeded its cap
print(d.get("budget_overflow_by_source"))
# e.g. {"retrieval": 3}  -- 3 retrieval items were dropped

[!NOTE] Overflow vs window overflow Budget overflow happens during per-source cap enforcement (Stage 4a). Window overflow happens when total items still exceed max_tokens after budget filtering (Stage 4b). Both are tracked in diagnostics.

Source Types

The SourceType enum defines the valid source categories:

ValueDescription
SourceType.SYSTEMSystem prompts and instructions
SourceType.MEMORYPersistent memory entries
SourceType.CONVERSATIONConversation history turns
SourceType.RETRIEVALRAG / search results
SourceType.TOOLTool or function call outputs
SourceType.USERDirect user-provided context

See Also

On this page