Astro Intelligence

Ingestion API Reference

Ingestion API Reference

API reference for the anchor.ingestion module. For usage patterns and examples, see the Ingestion Guide.


DocumentIngester

Orchestrates document parsing, chunking, and metadata extraction. Converts raw files or text into ContextItem objects suitable for retriever.index(items).

class DocumentIngester(
    chunker: Chunker | None = None,
    tokenizer: Tokenizer | None = None,
    parsers: dict[str, DocumentParser] | None = None,
    enricher: MetadataEnricher | None = None,
    source_type: SourceType = SourceType.RETRIEVAL,
    priority: int = 5,
)
ParameterTypeDefaultDescription
chunkerChunker | NoneRecursiveCharacterChunker()Chunking strategy
tokenizerTokenizer | Nonedefault counterToken counter for size-aware splitting
parsersdict[str, DocumentParser] | Nonebuilt-in parser mapExtension-to-parser overrides
enricherMetadataEnricher | NoneNoneChain of metadata enrichment functions
source_typeSourceTypeSourceType.RETRIEVALSource type tag for produced items
priorityint5Priority value for produced items

Methods

ingest_text(text, doc_id=None, doc_metadata=None)

Ingest raw text into context items.

ParameterTypeDefaultDescription
textstrrequiredDocument text to ingest
doc_idstr | NoneNoneDocument ID; generated if not provided
doc_metadatadict[str, Any] | NoneNoneDocument-level metadata

Returns: list[ContextItem]

ingest_file(path, doc_id=None)

Parse and chunk a single file. The parser is auto-detected from file extension.

ParameterTypeDefaultDescription
pathPath | strrequiredPath to the file
doc_idstr | NoneNoneDocument ID; generated if not provided

Returns: list[ContextItem] Raises: IngestionError if no parser found; FileNotFoundError if file missing.

ingest_directory(directory, glob_pattern="**/*", extensions=None)

Recursively ingest all matching files in a directory.

ParameterTypeDefaultDescription
directoryPath | strrequiredRoot directory to scan
glob_patternstr"**/*"Glob pattern for file discovery
extensionslist[str] | NoneNoneFilter by extensions; None = all registered

Returns: list[ContextItem] Raises: IngestionError if directory does not exist.


Chunkers

All chunkers implement the Chunker protocol:

def chunk(self, text: str, metadata: dict[str, Any] | None = None) -> list[str]

FixedSizeChunker

Split text into fixed-size chunks by token count with overlap.

class FixedSizeChunker(
    chunk_size: int = 512,
    overlap: int = 50,
    tokenizer: Tokenizer | None = None,
)
ParameterTypeDefaultDescription
chunk_sizeint512Maximum tokens per chunk
overlapint50Overlapping tokens at boundary
tokenizerTokenizer | Nonedefault counterToken counter

RecursiveCharacterChunker

Split text using a hierarchy of separators, falling back to finer splits.

class RecursiveCharacterChunker(
    chunk_size: int = 512,
    overlap: int = 50,
    separators: tuple[str, ...] | None = None,
    tokenizer: Tokenizer | None = None,
)
ParameterTypeDefaultDescription
chunk_sizeint512Maximum tokens per chunk
overlapint50Overlapping tokens
separatorstuple[str, ...] | None("\n\n", "\n", ". ", " ")Separator hierarchy
tokenizerTokenizer | Nonedefault counterToken counter

SentenceChunker

Split text at sentence boundaries, grouping sentences to fill chunks. Overlap is measured in sentences.

class SentenceChunker(
    chunk_size: int = 512,
    overlap: int = 1,
    tokenizer: Tokenizer | None = None,
)
ParameterTypeDefaultDescription
chunk_sizeint512Maximum tokens per chunk
overlapint1Overlapping sentences
tokenizerTokenizer | Nonedefault counterToken counter

SemanticChunker

Split text at semantic boundaries using embedding similarity.

class SemanticChunker(
    embed_fn: Callable[[list[str]], list[list[float]]],
    tokenizer: Tokenizer | None = None,
    threshold: float = 0.5,
    chunk_size: int = 512,
    min_chunk_size: int = 50,
)
ParameterTypeDefaultDescription
embed_fnCallable[[list[str]], list[list[float]]]requiredBatch embedding function
tokenizerTokenizer | Nonedefault counterToken counter
thresholdfloat0.5Cosine similarity split threshold
chunk_sizeint512Maximum tokens per chunk
min_chunk_sizeint50Minimum tokens; smaller merge

CodeChunker

Split source code at function, class, and definition boundaries.

class CodeChunker(
    language: str | None = None,
    chunk_size: int = 512,
    overlap: int = 50,
    tokenizer: Tokenizer | None = None,
)
ParameterTypeDefaultDescription
languagestr | NoneNoneLanguage name; auto-detected from metadata
chunk_sizeint512Maximum tokens per chunk
overlapint50Overlap tokens for fallback chunker
tokenizerTokenizer | Nonedefault counterToken counter

Supported languages: python, javascript, typescript, go, rust.

TableAwareChunker

Preserve tables as atomic chunks while delegating prose to an inner chunker.

class TableAwareChunker(
    inner_chunker: Any | None = None,
    chunk_size: int = 512,
    tokenizer: Tokenizer | None = None,
)
ParameterTypeDefaultDescription
inner_chunkerAny | NoneRecursiveCharacterChunker()Chunker for non-table text
chunk_sizeint512Maximum tokens per chunk
tokenizerTokenizer | Nonedefault counterToken counter

ParentChildChunker

Two-level hierarchical chunker producing large parent and small child chunks.

class ParentChildChunker(
    parent_chunk_size: int = 1024,
    child_chunk_size: int = 256,
    parent_overlap: int = 100,
    child_overlap: int = 25,
    tokenizer: Tokenizer | None = None,
)
ParameterTypeDefaultDescription
parent_chunk_sizeint1024Token size for parent chunks
child_chunk_sizeint256Token size for child chunks
parent_overlapint100Token overlap between parents
child_overlapint25Token overlap between children
tokenizerTokenizer | Nonedefault counterToken counter

Methods

  • chunk(text, metadata=None) -- Returns child chunk texts only (list[str]).
  • chunk_with_metadata(text, metadata=None) -- Returns list[tuple[str, dict[str, Any]]] with parent_id, parent_text, parent_index, child_index, and is_child_chunk in each metadata dict.

Parsers

All parsers implement the DocumentParser protocol:

def parse(self, source: Path | bytes) -> tuple[str, dict[str, Any]]

PlainTextParser

Parse plain text files. Supported extensions: .txt.

Metadata produced: filename, extension, line_count.

MarkdownParser

Parse Markdown files, extracting headings and detecting frontmatter. Supported extensions: .md, .markdown.

Metadata produced: filename, extension, title (first H1), headings list, has_frontmatter.

HTMLParser

Parse HTML files, stripping tags and extracting text. Uses Python's stdlib html.parser with zero external dependencies. Supported extensions: .html, .htm.

Metadata produced: filename, extension, title.

PDFParser

Parse PDF files using pypdf. Supported extensions: .pdf.

[!NOTE] Requires the pdf optional extra: pip install astro-anchor[pdf]. Raises IngestionError if pypdf is not installed.Metadata produced: filename, extension, page_count, title, author.


Metadata Functions

generate_doc_id(content, source_path=None)

Generate a deterministic 16-character hex document ID from SHA-256.

ParameterTypeDefaultDescription
contentstrrequiredFull document text
source_pathstr | NoneNoneFile path used as uniqueness salt

Returns: str -- 16-character hex string.

generate_chunk_id(doc_id, chunk_index)

Generate a chunk ID in the format "{doc_id}-chunk-{chunk_index}".

ParameterTypeDefaultDescription
doc_idstrrequiredParent document ID
chunk_indexintrequiredZero-based chunk index

Returns: str

extract_chunk_metadata(chunk_text, chunk_index, total_chunks, doc_id, doc_metadata=None)

Build standard metadata for a single chunk.

ParameterTypeDefaultDescription
chunk_textstrrequiredChunk text content
chunk_indexintrequiredZero-based chunk position
total_chunksintrequiredTotal chunks in document
doc_idstrrequiredParent document ID
doc_metadatadict[str, Any] | NoneNoneDocument-level metadata

Returns: dict[str, Any] with keys: parent_doc_id, chunk_index, total_chunks, word_count, char_count, plus propagated doc metadata (prefixed with doc_).


MetadataEnricher

Chain of user-provided metadata enrichment functions.

class MetadataEnricher(
    enrichers: list[Callable[[str, int, int, dict[str, Any]], dict[str, Any]]] | None = None,
)
ParameterTypeDefaultDescription
enricherslist[Callable] | NoneNoneInitial list of enricher functions

Each enricher callable has signature: (text: str, chunk_index: int, total_chunks: int, metadata: dict) -> dict

Methods

add(fn)

Register an additional enricher function.

enrich(text, chunk_index, total_chunks, metadata)

Run all enrichers in order, threading metadata through. Returns the enriched metadata dict.


ParentExpander

Post-processor that expands child chunks back to parent text, deduplicating by parent_id.

class ParentExpander(
    keep_child: bool = False,
)
ParameterTypeDefaultDescription
keep_childboolFalseKeep original child content in original_child_content metadata

Methods

process(items, query=None)

Expand child chunks to parent text. Items with is_child_chunk in metadata have their content replaced with parent_text. Multiple children from the same parent are deduplicated (first occurrence wins).

ParameterTypeDefaultDescription
itemslist[ContextItem]requiredItems to post-process
queryQueryBundle | NoneNoneOriginal query (unused)

Returns: list[ContextItem]

On this page