Astro Intelligence
AnchorGuides

Ingestion Guide

Ingestion Guide

The ingestion module converts raw documents into ContextItem objects ready for indexing and retrieval. The pipeline follows four stages:


Parse --> Chunk --> Enrich --> Index
  1. Parse -- extract text and metadata from files (Markdown, HTML, PDF, plain text).
  2. Chunk -- split the text into retrieval-sized segments.
  3. Enrich -- attach metadata (doc IDs, chunk positions, custom fields).
  4. Index -- feed ContextItem objects to a retriever.

Quick Start

from anchor.ingestion import DocumentIngester

ingester = DocumentIngester()

# Ingest raw text
items = ingester.ingest_text("Astro-context is a modular RAG framework.")
print(items[0].content)

# Ingest a single file
items = ingester.ingest_file("docs/intro.md")

# Ingest an entire directory
items = ingester.ingest_directory("docs/", extensions=[".md", ".txt"])

DocumentIngester

DocumentIngester orchestrates the full pipeline. It auto-detects parsers from file extensions and delegates chunking to any Chunker implementation.

from anchor.ingestion import (
    DocumentIngester,
    SentenceChunker,
    MetadataEnricher,
)
from anchor.models.context import SourceType

def add_category(text, idx, total, meta):
    meta["category"] = "documentation"
    return meta

ingester = DocumentIngester(
    chunker=SentenceChunker(chunk_size=256, overlap=1),
    enricher=MetadataEnricher(enrichers=[add_category]),
    source_type=SourceType.RETRIEVAL,
    priority=5,
)

items = ingester.ingest_text(
    "First sentence. Second sentence. Third sentence.",
    doc_id="manual-id-001",
)
for item in items:
    print(item.id, item.metadata.get("category"))

Constructor Parameters

ParameterTypeDefaultDescription
chunkerChunker | NoneRecursiveCharacterChunker()Chunking strategy
tokenizerTokenizer | Nonedefault counterToken counter for size-aware splitting
parsersdict[str, DocumentParser] | Nonebuilt-in parsersExtension-to-parser overrides
enricherMetadataEnricher | NoneNoneChain of metadata enrichment functions
source_typeSourceTypeSourceType.RETRIEVALSource type tag for produced items
priorityint5Priority value for produced items

Methods

  • ingest_text(text, doc_id=None, doc_metadata=None) -- Chunk raw text into ContextItem objects.
  • ingest_file(path, doc_id=None) -- Parse and chunk a single file.
  • ingest_directory(directory, glob_pattern="**/*", extensions=None) -- Recursively ingest all matching files.

Chunkers

All chunkers implement the Chunker protocol and expose a chunk(text, metadata=None) -> list[str] method.

FixedSizeChunker

Splits text into fixed-size chunks measured by token count, with configurable overlap.

from anchor.ingestion import FixedSizeChunker

chunker = FixedSizeChunker(chunk_size=128, overlap=20)
chunks = chunker.chunk("A very long document text...")
ParameterTypeDefaultDescription
chunk_sizeint512Maximum tokens per chunk
overlapint50Overlapping tokens at boundary
tokenizerTokenizer | Nonedefault counterToken counter

RecursiveCharacterChunker

Splits text using a hierarchy of separators, falling back to finer splits when a section exceeds the token budget. Separator hierarchy: "\n\n" --> "\n" --> ". " --> " ".

from anchor.ingestion import RecursiveCharacterChunker

chunker = RecursiveCharacterChunker(chunk_size=256, overlap=30)
chunks = chunker.chunk("Paragraph one.\n\nParagraph two.\n\nParagraph three.")
ParameterTypeDefaultDescription
chunk_sizeint512Maximum tokens per chunk
overlapint50Overlapping tokens at boundary
separatorstuple[str, ...] | None("\n\n","\n",". "," ")Custom separator hierarchy
tokenizerTokenizer | Nonedefault counterToken counter

SentenceChunker

Groups sentences to fill chunks up to the token budget. Overlap is measured in sentences rather than tokens.

from anchor.ingestion import SentenceChunker

chunker = SentenceChunker(chunk_size=256, overlap=1)
chunks = chunker.chunk("First sentence. Second sentence. Third sentence.")
ParameterTypeDefaultDescription
chunk_sizeint512Maximum tokens per chunk
overlapint1Overlapping sentences at boundary
tokenizerTokenizer | Nonedefault counterToken counter

SemanticChunker

Splits text at semantic boundaries using embedding similarity. Sentences whose adjacent cosine similarity drops below a threshold are split into separate chunks.

import math
from anchor.ingestion import SemanticChunker

# Deterministic embed function for demonstration
def embed_fn(texts: list[str]) -> list[list[float]]:
    return [
        [math.sin(i + c) for c in range(8)]
        for i, t in enumerate(texts)
    ]

chunker = SemanticChunker(
    embed_fn=embed_fn,
    threshold=0.5,
    chunk_size=256,
    min_chunk_size=30,
)
chunks = chunker.chunk("Sentence one. Sentence two. Totally different topic here.")
ParameterTypeDefaultDescription
embed_fnCallable[[list[str]], list[list[float]]]requiredEmbedding function
tokenizerTokenizer | Nonedefault counterToken counter
thresholdfloat0.5Cosine similarity split threshold
chunk_sizeint512Maximum tokens per chunk
min_chunk_sizeint50Minimum tokens; smaller chunks merge

CodeChunker

Splits source code at function, class, and top-level definition boundaries using language-specific regex patterns. Falls back to RecursiveCharacterChunker when no boundaries are detected.

Supported languages: Python, JavaScript, TypeScript, Go, Rust.

from anchor.ingestion import CodeChunker

chunker = CodeChunker(language="python", chunk_size=256)
code = '''
def hello():
    print("hello")

def world():
    print("world")

class Greeter:
    pass
'''
chunks = chunker.chunk(code)
ParameterTypeDefaultDescription
languagestr | NoneNoneLanguage name; auto-detected from metadata
chunk_sizeint512Maximum tokens per chunk
overlapint50Overlap tokens for fallback chunker
tokenizerTokenizer | Nonedefault counterToken counter

TableAwareChunker

Detects markdown and HTML tables, preserves them as atomic units, and delegates prose to an inner chunker. Oversized tables are split row-by-row with the header preserved.

from anchor.ingestion import TableAwareChunker

chunker = TableAwareChunker(chunk_size=256)
text = """
Some introductory text.

| Name  | Value |
|-------|-------|
| alpha | 1     |
| beta  | 2     |

More prose after the table.
"""
chunks = chunker.chunk(text)
ParameterTypeDefaultDescription
inner_chunkerAny | NoneRecursiveCharacterChunker()Chunker for non-table text
chunk_sizeint512Maximum tokens per chunk
tokenizerTokenizer | Nonedefault counterToken counter

ParentChildChunker

Two-level hierarchical chunker that produces large parent chunks for context and small child chunks for retrieval. Use chunk_with_metadata() to get child chunks with parent_id and parent_text in their metadata.

from anchor.ingestion import ParentChildChunker

chunker = ParentChildChunker(
    parent_chunk_size=512,
    child_chunk_size=128,
    parent_overlap=50,
    child_overlap=10,
)

# Plain string chunks (Chunker protocol)
children = chunker.chunk("A long document to split hierarchically...")

# Chunks with metadata (includes parent_id, parent_text)
children_with_meta = chunker.chunk_with_metadata("A long document...")
for text, meta in children_with_meta:
    print(meta["parent_id"], len(text))
ParameterTypeDefaultDescription
parent_chunk_sizeint1024Token size for parent chunks
child_chunk_sizeint256Token size for child chunks
parent_overlapint100Token overlap between parent chunks
child_overlapint25Token overlap between child chunks
tokenizerTokenizer | Nonedefault counterToken counter

Parsers

Parsers implement the DocumentParser protocol and return (text, metadata) tuples.

ParserExtensionsDependencies
PlainTextParser.txtnone
MarkdownParser.md, .markdownnone
HTMLParser.html, .htmnone
PDFParser.pdfpypdf

[!NOTE] PDFParser requires the optional pdf extra: pip install astro-anchor[pdf]``DocumentIngester auto-selects the parser by file extension. Override via the parsers constructor argument:

from anchor.ingestion import DocumentIngester, PlainTextParser

ingester = DocumentIngester(
    parsers={".log": PlainTextParser()},
)

Metadata

Helper Functions

  • generate_doc_id(content, source_path=None) -- deterministic 16-char hex ID from SHA-256.
  • generate_chunk_id(doc_id, chunk_index) -- returns "{doc_id}-chunk-{chunk_index}".
  • extract_chunk_metadata(chunk_text, chunk_index, total_chunks, doc_id, doc_metadata=None) -- standard metadata dict with parent_doc_id, chunk_index, total_chunks, word_count, char_count.

MetadataEnricher

Chain multiple enrichment functions that run in order during ingestion.

from anchor.ingestion import MetadataEnricher

def tag_language(text, idx, total, meta):
    meta["language"] = "en"
    return meta

def add_summary_flag(text, idx, total, meta):
    meta["needs_summary"] = len(text.split()) > 100
    return meta

enricher = MetadataEnricher(enrichers=[tag_language, add_summary_flag])
enricher.add(lambda text, idx, total, meta: {**meta, "version": "1.0"})

ParentExpander

ParentExpander is a post-processor that expands retrieved child chunks back to their parent text, deduplicating by parent_id.

from anchor.ingestion import ParentExpander
from anchor.pipeline import postprocessor_step

expander = ParentExpander(keep_child=True)
step = postprocessor_step("expand-parents", expander)

[!TIP] Combine ParentChildChunker + ParentExpander for a complete hierarchical retrieval workflow: index small children, retrieve them, then expand to full parent context before the LLM sees them.


Full Pipeline Example

import math
from anchor.ingestion import (
    DocumentIngester,
    SemanticChunker,
    MetadataEnricher,
)

def embed_fn(texts: list[str]) -> list[list[float]]:
    return [[math.sin(i + c) for c in range(8)] for i, _ in enumerate(texts)]

def add_source(text, idx, total, meta):
    meta["source"] = "user-docs"
    return meta

ingester = DocumentIngester(
    chunker=SemanticChunker(embed_fn=embed_fn, threshold=0.5, chunk_size=256),
    enricher=MetadataEnricher(enrichers=[add_source]),
)

items = ingester.ingest_text(
    "Machine learning models learn patterns from data. "
    "They generalize to unseen examples. "
    "Transformers use self-attention mechanisms. "
    "RAG combines retrieval with generation.",
    doc_id="ml-intro",
)
for item in items:
    print(f"{item.id}: {item.content[:60]}...")
    print(f"  metadata: {item.metadata}")

[!CAUTION] SemanticChunker calls embed_fn once per chunk() invocation for all sentences in the document. Make sure your embedding function can handle batch sizes equal to the sentence count.

On this page