Astro Intelligence

Evaluation API Reference

Evaluation API Reference

All classes are importable from anchor.evaluation. For usage examples see the Evaluation Guide.


RetrievalMetrics

Immutable Pydantic model holding metrics for a single retrieval evaluation. All values are bounded in [0.0, 1.0].

from anchor.evaluation import RetrievalMetrics
FieldTypeDescription
precision_at_kfloatFraction of retrieved items that are relevant
recall_at_kfloatFraction of relevant items that were retrieved
f1_at_kfloatHarmonic mean of precision and recall
mrrfloatReciprocal of the rank of the first relevant item
ndcgfloatNormalized Discounted Cumulative Gain
hit_ratefloat1.0 if at least one relevant item was retrieved

RAGMetrics

Immutable Pydantic model for RAGAS-style RAG evaluation. All values bounded in [0.0, 1.0].

from anchor.evaluation import RAGMetrics
FieldTypeDescription
faithfulnessfloatHow faithful the answer is to the provided contexts
answer_relevancyfloatHow relevant the answer is to the query
context_precisionfloatPrecision of retrieved contexts for the query
context_recallfloatRecall of retrieved contexts against ground truth

EvaluationResult

Combines retrieval and RAG metrics into a single evaluation result.

from anchor.evaluation import EvaluationResult
FieldTypeDescription
retrieval_metricsRetrievalMetrics | NoneRetrieval metrics, if computed
rag_metricsRAGMetrics | NoneRAG metrics, if computed
metadatadict[str, Any]Arbitrary metadata for the run

RetrievalMetricsCalculator

Computes standard IR metrics from ranked results and a known set of relevant document IDs. No LLM dependencies.

from anchor.evaluation import RetrievalMetricsCalculator

Constructor:

RetrievalMetricsCalculator(k: int = 10)
ParameterTypeDefaultDescription
kint10Default cutoff for top-k evaluation

[!CAUTION] Raises ValueError if k < 1.evaluate(retrieved, relevant, k=None) -> RetrievalMetrics

def evaluate(
    self,
    retrieved: list[ContextItem],
    relevant: list[str],
    k: int | None = None,
) -> RetrievalMetrics
ParameterTypeDefaultDescription
retrievedlist[ContextItem]--Items in ranked order
relevantlist[str]--Ground-truth relevant document IDs
kint | NoneNoneCutoff override; falls back to instance default
from anchor.evaluation import RetrievalMetricsCalculator
from anchor.models.context import ContextItem, SourceType

calc = RetrievalMetricsCalculator(k=5)
items = [ContextItem(id="a", content="x", source=SourceType.RETRIEVAL)]
metrics = calc.evaluate(items, relevant=["a", "b"], k=3)
print(metrics.precision_at_k)  # 1.0

LLMRAGEvaluator

RAGAS-style RAG evaluator driven by user-supplied callback functions. Each callback returns a float in [0.0, 1.0].

from anchor.evaluation import LLMRAGEvaluator

Constructor:

LLMRAGEvaluator(
    *,
    faithfulness_fn: Callable[[str, list[str]], float] | None = None,
    relevancy_fn: Callable[[str, str], float] | None = None,
    precision_fn: Callable[[str, list[str]], float] | None = None,
    recall_fn: Callable[[str, list[str], str], float] | None = None,
)
ParameterTypeDescription
faithfulness_fn(answer, contexts) -> floatGrounding check
relevancy_fn(query, answer) -> floatRelevance check
precision_fn(query, contexts) -> floatContext precision
recall_fn(query, contexts, ground_truth) -> floatContext recall

evaluate(query, answer, contexts, ground_truth=None) -> RAGMetrics

ParameterTypeDefaultDescription
querystr--The original user query
answerstr--The generated answer
contextslist[str]--Context strings fed to the generator
ground_truthstr | NoneNoneReference answer for recall

Dimensions without registered callbacks return 0.0.


PipelineEvaluator

Orchestrates retrieval and RAG evaluation into a single result.

from anchor.evaluation import PipelineEvaluator

Constructor:

PipelineEvaluator(
    *,
    retrieval_calculator: RetrievalMetricsCalculator | None = None,
    rag_evaluator: LLMRAGEvaluator | None = None,
)
ParameterTypeDefaultDescription
retrieval_calculatorRetrievalMetricsCalculator | NoneNoneDefaults to a new instance
rag_evaluatorLLMRAGEvaluator | NoneNoneOptional LLM-based evaluator

evaluate_retrieval(retrieved, relevant, k=10) -> RetrievalMetrics -- evaluates retrieval only.

evaluate_rag(query, answer, contexts, ground_truth=None) -> RAGMetrics -- evaluates RAG only.

[!CAUTION] Raises ValueError if no rag_evaluator was configured.evaluate(...) -> EvaluationResult -- runs both evaluations.

def evaluate(
    self,
    query: str,
    answer: str,
    retrieved: list[ContextItem],
    relevant: list[str],
    contexts: list[str],
    ground_truth: str | None = None,
    k: int = 10,
    metadata: dict[str, Any] | None = None,
) -> EvaluationResult

EvaluationSample (A/B testing)

A single evaluation sample. Defined in anchor.evaluation.ab_testing.

from anchor.evaluation import EvaluationSample
FieldTypeDefaultDescription
querystr--The query string
relevant_idslist[str][]Relevant document IDs
metadatadict[str, Any]{}Arbitrary metadata

EvaluationDataset (A/B testing)

A collection of EvaluationSample instances.

from anchor.evaluation import EvaluationDataset
FieldTypeDefaultDescription
sampleslist[EvaluationSample][]The evaluation samples
namestr""Optional dataset name
metadatadict[str, Any]{}Arbitrary metadata

AggregatedMetrics (A/B testing)

Aggregated retrieval metrics across multiple evaluation samples.

from anchor.evaluation import AggregatedMetrics
FieldTypeDefaultDescription
mean_precisionfloat0.0Mean precision@k
mean_recallfloat0.0Mean recall@k
mean_f1float0.0Mean F1@k
mean_mrrfloat0.0Mean MRR
mean_ndcgfloat0.0Mean NDCG
num_samplesint0Number of samples evaluated

ABTestResult

Result of an A/B test comparing two retrievers.

from anchor.evaluation import ABTestResult
FieldTypeDefaultDescription
metrics_aAggregatedMetrics--Metrics for retriever A
metrics_bAggregatedMetrics--Metrics for retriever B
winnerstr--"a", "b", or "tie"
p_valuefloat--Paired t-test p-value
is_significantbool--Whether the result is statistically significant
significance_levelfloat0.05Threshold for significance
per_metric_comparisondict[str, dict[str, Any]]{}Per-metric deltas
metadatadict[str, Any]{}Arbitrary metadata

ABTestRunner

Runs an A/B test comparing two retrievers on a shared dataset.

from anchor.evaluation import ABTestRunner

Constructor:

ABTestRunner(evaluator: PipelineEvaluator, dataset: EvaluationDataset)
ParameterTypeDescription
evaluatorPipelineEvaluatorEvaluator for computing retrieval metrics
datasetEvaluationDatasetShared evaluation dataset

run(retriever_a, retriever_b, k=10, significance_level=0.05) -> ABTestResult

ParameterTypeDefaultDescription
retriever_aRetriever--First retriever
retriever_bRetriever--Second retriever
kint10Top-k cutoff
significance_levelfloat0.05p-value threshold

BatchEvaluator

Runs evaluation over an entire dataset and aggregates results. Importable from anchor.evaluation.batch.

from anchor.evaluation.batch import BatchEvaluator

Constructor:

BatchEvaluator(*, evaluator: PipelineEvaluator, retriever: Retriever, top_k: int = 10)
ParameterTypeDefaultDescription
evaluatorPipelineEvaluator--Per-sample evaluator
retrieverRetriever--Retriever for fetching items
top_kint10Items to retrieve per query

evaluate(dataset, k=10) -> AggregatedMetrics

Returns AggregatedMetrics (batch module variant) with count, mean_precision, mean_recall, mean_f1, mean_mrr, mean_ndcg, mean_hit_rate, p95_precision, p95_recall, min_precision, min_recall, and per_sample_results.


HumanJudgment

A single human relevance judgment for a query-document pair.

from anchor.evaluation import HumanJudgment
FieldTypeConstraintsDescription
querystr--The evaluated query
item_idstr--Document ID being judged
relevanceint0 <= x <= 3Relevance score
annotatorstr--Annotator identifier
metadatadict[str, Any]--Arbitrary metadata

HumanEvaluationCollector

Collects human relevance judgments and computes inter-annotator agreement.

from anchor.evaluation import HumanEvaluationCollector

Constructor: HumanEvaluationCollector() -- no parameters.

Property: judgments -> list[HumanJudgment] -- copy of all collected judgments.

add_judgment(judgment: HumanJudgment) -> None -- add a single judgment.

add_judgments(judgments: list[HumanJudgment]) -> None -- add multiple judgments.

compute_agreement() -> float -- Cohen's kappa over (query, item_id) pairs judged by at least two annotators. Returns 0.0 if no overlapping judgments exist.

to_dataset(threshold: int = 2) -> EvaluationDataset -- converts judgments into an EvaluationDataset. Items with mean relevance at or above threshold are considered relevant.

compute_metrics() -> dict[str, float] -- returns mean_relevance, agreement, num_judgments, num_annotators, num_queries.

On this page