Ingestion Guide
Ingestion Guide
The ingestion module converts raw documents into ContextItem objects ready for
indexing and retrieval. The pipeline follows four stages:
Parse --> Chunk --> Enrich --> Index- Parse -- extract text and metadata from files (Markdown, HTML, PDF, plain text).
- Chunk -- split the text into retrieval-sized segments.
- Enrich -- attach metadata (doc IDs, chunk positions, custom fields).
- Index -- feed
ContextItemobjects to a retriever.
Quick Start
from anchor.ingestion import DocumentIngester
ingester = DocumentIngester()
# Ingest raw text
items = ingester.ingest_text("Astro-context is a modular RAG framework.")
print(items[0].content)
# Ingest a single file
items = ingester.ingest_file("docs/intro.md")
# Ingest an entire directory
items = ingester.ingest_directory("docs/", extensions=[".md", ".txt"])DocumentIngester
DocumentIngester orchestrates the full pipeline. It auto-detects parsers from
file extensions and delegates chunking to any Chunker implementation.
from anchor.ingestion import (
DocumentIngester,
SentenceChunker,
MetadataEnricher,
)
from anchor.models.context import SourceType
def add_category(text, idx, total, meta):
meta["category"] = "documentation"
return meta
ingester = DocumentIngester(
chunker=SentenceChunker(chunk_size=256, overlap=1),
enricher=MetadataEnricher(enrichers=[add_category]),
source_type=SourceType.RETRIEVAL,
priority=5,
)
items = ingester.ingest_text(
"First sentence. Second sentence. Third sentence.",
doc_id="manual-id-001",
)
for item in items:
print(item.id, item.metadata.get("category"))Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
chunker | Chunker | None | RecursiveCharacterChunker() | Chunking strategy |
tokenizer | Tokenizer | None | default counter | Token counter for size-aware splitting |
parsers | dict[str, DocumentParser] | None | built-in parsers | Extension-to-parser overrides |
enricher | MetadataEnricher | None | None | Chain of metadata enrichment functions |
source_type | SourceType | SourceType.RETRIEVAL | Source type tag for produced items |
priority | int | 5 | Priority value for produced items |
Methods
ingest_text(text, doc_id=None, doc_metadata=None)-- Chunk raw text intoContextItemobjects.ingest_file(path, doc_id=None)-- Parse and chunk a single file.ingest_directory(directory, glob_pattern="**/*", extensions=None)-- Recursively ingest all matching files.
Chunkers
All chunkers implement the Chunker protocol and expose a
chunk(text, metadata=None) -> list[str] method.
FixedSizeChunker
Splits text into fixed-size chunks measured by token count, with configurable overlap.
from anchor.ingestion import FixedSizeChunker
chunker = FixedSizeChunker(chunk_size=128, overlap=20)
chunks = chunker.chunk("A very long document text...")| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size | int | 512 | Maximum tokens per chunk |
overlap | int | 50 | Overlapping tokens at boundary |
tokenizer | Tokenizer | None | default counter | Token counter |
RecursiveCharacterChunker
Splits text using a hierarchy of separators, falling back to finer splits when a
section exceeds the token budget. Separator hierarchy:
"\n\n" --> "\n" --> ". " --> " ".
from anchor.ingestion import RecursiveCharacterChunker
chunker = RecursiveCharacterChunker(chunk_size=256, overlap=30)
chunks = chunker.chunk("Paragraph one.\n\nParagraph two.\n\nParagraph three.")| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size | int | 512 | Maximum tokens per chunk |
overlap | int | 50 | Overlapping tokens at boundary |
separators | tuple[str, ...] | None | ("\n\n","\n",". "," ") | Custom separator hierarchy |
tokenizer | Tokenizer | None | default counter | Token counter |
SentenceChunker
Groups sentences to fill chunks up to the token budget. Overlap is measured in sentences rather than tokens.
from anchor.ingestion import SentenceChunker
chunker = SentenceChunker(chunk_size=256, overlap=1)
chunks = chunker.chunk("First sentence. Second sentence. Third sentence.")| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size | int | 512 | Maximum tokens per chunk |
overlap | int | 1 | Overlapping sentences at boundary |
tokenizer | Tokenizer | None | default counter | Token counter |
SemanticChunker
Splits text at semantic boundaries using embedding similarity. Sentences whose adjacent cosine similarity drops below a threshold are split into separate chunks.
import math
from anchor.ingestion import SemanticChunker
# Deterministic embed function for demonstration
def embed_fn(texts: list[str]) -> list[list[float]]:
return [
[math.sin(i + c) for c in range(8)]
for i, t in enumerate(texts)
]
chunker = SemanticChunker(
embed_fn=embed_fn,
threshold=0.5,
chunk_size=256,
min_chunk_size=30,
)
chunks = chunker.chunk("Sentence one. Sentence two. Totally different topic here.")| Parameter | Type | Default | Description |
|---|---|---|---|
embed_fn | Callable[[list[str]], list[list[float]]] | required | Embedding function |
tokenizer | Tokenizer | None | default counter | Token counter |
threshold | float | 0.5 | Cosine similarity split threshold |
chunk_size | int | 512 | Maximum tokens per chunk |
min_chunk_size | int | 50 | Minimum tokens; smaller chunks merge |
CodeChunker
Splits source code at function, class, and top-level definition boundaries using
language-specific regex patterns. Falls back to RecursiveCharacterChunker when
no boundaries are detected.
Supported languages: Python, JavaScript, TypeScript, Go, Rust.
from anchor.ingestion import CodeChunker
chunker = CodeChunker(language="python", chunk_size=256)
code = '''
def hello():
print("hello")
def world():
print("world")
class Greeter:
pass
'''
chunks = chunker.chunk(code)| Parameter | Type | Default | Description |
|---|---|---|---|
language | str | None | None | Language name; auto-detected from metadata |
chunk_size | int | 512 | Maximum tokens per chunk |
overlap | int | 50 | Overlap tokens for fallback chunker |
tokenizer | Tokenizer | None | default counter | Token counter |
TableAwareChunker
Detects markdown and HTML tables, preserves them as atomic units, and delegates prose to an inner chunker. Oversized tables are split row-by-row with the header preserved.
from anchor.ingestion import TableAwareChunker
chunker = TableAwareChunker(chunk_size=256)
text = """
Some introductory text.
| Name | Value |
|-------|-------|
| alpha | 1 |
| beta | 2 |
More prose after the table.
"""
chunks = chunker.chunk(text)| Parameter | Type | Default | Description |
|---|---|---|---|
inner_chunker | Any | None | RecursiveCharacterChunker() | Chunker for non-table text |
chunk_size | int | 512 | Maximum tokens per chunk |
tokenizer | Tokenizer | None | default counter | Token counter |
ParentChildChunker
Two-level hierarchical chunker that produces large parent chunks for context and
small child chunks for retrieval. Use chunk_with_metadata() to get child chunks
with parent_id and parent_text in their metadata.
from anchor.ingestion import ParentChildChunker
chunker = ParentChildChunker(
parent_chunk_size=512,
child_chunk_size=128,
parent_overlap=50,
child_overlap=10,
)
# Plain string chunks (Chunker protocol)
children = chunker.chunk("A long document to split hierarchically...")
# Chunks with metadata (includes parent_id, parent_text)
children_with_meta = chunker.chunk_with_metadata("A long document...")
for text, meta in children_with_meta:
print(meta["parent_id"], len(text))| Parameter | Type | Default | Description |
|---|---|---|---|
parent_chunk_size | int | 1024 | Token size for parent chunks |
child_chunk_size | int | 256 | Token size for child chunks |
parent_overlap | int | 100 | Token overlap between parent chunks |
child_overlap | int | 25 | Token overlap between child chunks |
tokenizer | Tokenizer | None | default counter | Token counter |
Parsers
Parsers implement the DocumentParser protocol and return (text, metadata) tuples.
| Parser | Extensions | Dependencies |
|---|---|---|
PlainTextParser | .txt | none |
MarkdownParser | .md, .markdown | none |
HTMLParser | .html, .htm | none |
PDFParser | .pdf | pypdf |
[!NOTE]
PDFParserrequires the optionalpip install astro-anchor[pdf]``DocumentIngesterauto-selects the parser by file extension. Override via theparsersconstructor argument:
from anchor.ingestion import DocumentIngester, PlainTextParser
ingester = DocumentIngester(
parsers={".log": PlainTextParser()},
)Metadata
Helper Functions
generate_doc_id(content, source_path=None)-- deterministic 16-char hex ID from SHA-256.generate_chunk_id(doc_id, chunk_index)-- returns"{doc_id}-chunk-{chunk_index}".extract_chunk_metadata(chunk_text, chunk_index, total_chunks, doc_id, doc_metadata=None)-- standard metadata dict withparent_doc_id,chunk_index,total_chunks,word_count,char_count.
MetadataEnricher
Chain multiple enrichment functions that run in order during ingestion.
from anchor.ingestion import MetadataEnricher
def tag_language(text, idx, total, meta):
meta["language"] = "en"
return meta
def add_summary_flag(text, idx, total, meta):
meta["needs_summary"] = len(text.split()) > 100
return meta
enricher = MetadataEnricher(enrichers=[tag_language, add_summary_flag])
enricher.add(lambda text, idx, total, meta: {**meta, "version": "1.0"})ParentExpander
ParentExpander is a post-processor that expands retrieved child chunks back to
their parent text, deduplicating by parent_id.
from anchor.ingestion import ParentExpander
from anchor.pipeline import postprocessor_step
expander = ParentExpander(keep_child=True)
step = postprocessor_step("expand-parents", expander)[!TIP] Combine
ParentChildChunker+ParentExpanderfor a complete hierarchical retrieval workflow: index small children, retrieve them, then expand to full parent context before the LLM sees them.
Full Pipeline Example
import math
from anchor.ingestion import (
DocumentIngester,
SemanticChunker,
MetadataEnricher,
)
def embed_fn(texts: list[str]) -> list[list[float]]:
return [[math.sin(i + c) for c in range(8)] for i, _ in enumerate(texts)]
def add_source(text, idx, total, meta):
meta["source"] = "user-docs"
return meta
ingester = DocumentIngester(
chunker=SemanticChunker(embed_fn=embed_fn, threshold=0.5, chunk_size=256),
enricher=MetadataEnricher(enrichers=[add_source]),
)
items = ingester.ingest_text(
"Machine learning models learn patterns from data. "
"They generalize to unseen examples. "
"Transformers use self-attention mechanisms. "
"RAG combines retrieval with generation.",
doc_id="ml-intro",
)
for item in items:
print(f"{item.id}: {item.content[:60]}...")
print(f" metadata: {item.metadata}")[!CAUTION]
SemanticChunkercallsembed_fnonce perchunk()invocation for all sentences in the document. Make sure your embedding function can handle batch sizes equal to the sentence count.