Evaluation Module

Evaluation utilities for LLM applications.

This module provides comprehensive evaluation tools for assessing LLM outputs:

Ground Truth Metrics:: calculate_bleu() - BLEU score for n-gram overlap calculate_rouge() - ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) calculate_meteor() - METEOR score with precision, recall, and word order calculate_exact_match() - Exact match evaluation calculate_f1_score() - Token-level F1 score calculate_semantic_similarity() - Semantic similarity between texts
Quality Assessment:: assess_coherence() - Coherence and logical flow assess_fluency() - Fluency and naturalness assess_faithfulness() - Faithfulness to source material assess_answer_relevance() - Answer relevance to question detect_hallucination() - Detect unfounded claims
LLM-as-Judge:: llm_as_judge() - Use LLM to judge output quality pairwise_comparison() - Compare two outputs using LLM
A/B Testing:: ab_test() - Statistical A/B testing of outputs compare_outputs() - Multi-output comparison with rankings
Benchmarking:: run_benchmark() - Run benchmark on test cases benchmark_prompts() - Benchmark multiple prompts
Statistical Analysis:: calculate_statistics() - Statistical measures (mean, median, stdev, etc.) confidence_interval() - Confidence intervals for scores
Data Classes:: EvaluationResult - Result with score and details ComparisonResult - Result of comparing outputs BenchmarkResult - Benchmark run results TestCase - Test case definition
Enums:: EvaluationMetric - Standard evaluation metrics JudgmentCriterion - Criteria for LLM-as-judge

Examples

>>> # Common usage - core classes and metrics
>>> from kerb.evaluation import EvaluationResult, calculate_bleu, calculate_rouge
>>>
>>> # Specialized imports - metrics module
>>> from kerb.evaluation.metrics import (
...     calculate_meteor,
...     calculate_semantic_similarity
... )
>>>
>>> # Quality assessment
>>> from kerb.evaluation.quality import (
...     assess_coherence,
...     detect_hallucination
... )
>>>
>>> # LLM-as-judge
>>> from kerb.evaluation.judges import llm_as_judge, pairwise_comparison
>>>
>>> # Benchmarking
>>> from kerb.evaluation.benchmarks import run_benchmark

class kerb.evaluation.EvaluationResult(metric, score, details=<factory>, passed=None)[source]

Bases: object

Result of an evaluation with score and details.

metric: str

score: float

details: Dict[str, Any]

passed: bool | None = None

__init__(metric, score, details=<factory>, passed=None)

class kerb.evaluation.ComparisonResult(output_a_id, output_b_id, winner, scores, confidence=0.0, reasoning='')[source]

Bases: object

Result of comparing two outputs.

output_a_id: str

output_b_id: str

winner: str | None

scores: Dict[str, float]

confidence: float = 0.0

reasoning: str = ''

__init__(output_a_id, output_b_id, winner, scores, confidence=0.0, reasoning='')

class kerb.evaluation.BenchmarkResult(name, total_tests, passed_tests, failed_tests, average_score, scores, execution_time=0.0, details=<factory>)[source]

Bases: object

Result of a benchmark run.

name: str

total_tests: int

passed_tests: int

failed_tests: int

average_score: float

scores: List[float]

execution_time: float = 0.0

details: Dict[str, Any]

property pass_rate: float: Calculate pass rate percentage.

__init__(name, total_tests, passed_tests, failed_tests, average_score, scores, execution_time=0.0, details=<factory>)

class kerb.evaluation.TestCase(id, input, expected_output=None, metadata=<factory>, reference_outputs=<factory>)[source]

Bases: object

A single test case for evaluation.

id: str

input: str

expected_output: str | None = None

metadata: Dict[str, Any]

reference_outputs: List[str]

__init__(id, input, expected_output=None, metadata=<factory>, reference_outputs=<factory>)

class kerb.evaluation.EvaluationMetric(*values)[source]

Bases: Enum

Standard evaluation metrics.

BLEU = 'bleu'

ROUGE_1 = 'rouge-1'

ROUGE_2 = 'rouge-2'

ROUGE_L = 'rouge-l'

METEOR = 'meteor'

BERTSCORE = 'bertscore'

EXACT_MATCH = 'exact_match'

F1 = 'f1'

SEMANTIC_SIMILARITY = 'semantic_similarity'

class kerb.evaluation.JudgmentCriterion(*values)[source]

Bases: Enum

Criteria for LLM-as-judge evaluation.

RELEVANCE = 'relevance'

ACCURACY = 'accuracy'

COMPLETENESS = 'completeness'

COHERENCE = 'coherence'

FLUENCY = 'fluency'

HELPFULNESS = 'helpfulness'

HARMLESSNESS = 'harmlessness'

FAITHFULNESS = 'faithfulness'

CONSISTENCY = 'consistency'

kerb.evaluation.calculate_bleu(candidate, reference, n=4, weights=None)[source]

Calculate BLEU score between candidate and reference text(s).

BLEU (Bilingual Evaluation Understudy) measures n-gram overlap with brevity penalty.

Parameters:

candidate (str) – The generated text to evaluate
reference (Union[str, List[str]]) – Reference text(s) (ground truth)
n (int) – Maximum n-gram length (default: 4 for BLEU-4)
weights (Optional[List[float]]) – Weights for each n-gram (default: equal weights)

Returns:

BLEU score between 0 and 1

Return type:

float

Example

>>> calculate_bleu("the cat sat", "the cat sat on mat")
0.7598

kerb.evaluation.calculate_rouge(candidate, reference, rouge_type='rouge-l', beta=1.2)[source]

Calculate ROUGE scores between candidate and reference text(s).

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures recall of n-grams.

Parameters:

candidate (str) – The generated text to evaluate
reference (Union[str, List[str]]) – Reference text(s) (ground truth)
rouge_type (str) – Type of ROUGE - “rouge-1”, “rouge-2”, “rouge-l”
beta (float) – Beta parameter for F-measure (default: 1.2 favors recall)

Returns:

Dictionary with ‘precision’, ‘recall’, ‘fmeasure’ scores

Return type:

Dict[str, float]

Example

>>> calculate_rouge("the cat sat", "the cat sat on mat", "rouge-1")
{'precision': 1.0, 'recall': 0.6, 'fmeasure': 0.75}

kerb.evaluation.calculate_meteor(candidate, reference, alpha=0.9, beta=3.0, gamma=0.5)[source]

Calculate METEOR score (simplified version without stemming/synonyms).

METEOR considers precision, recall, and word order with harmonic mean.

Parameters:

candidate (str) – The generated text to evaluate
reference (Union[str, List[str]]) – Reference text(s) (ground truth)
alpha (float) – Weight for recall vs precision (default: 0.9)
beta (float) – Shape parameter for f-mean (default: 3.0)
gamma (float) – Penalty weight for fragmentation (default: 0.5)

Returns:

METEOR score between 0 and 1

Return type:

float

Example

>>> calculate_meteor("the cat sat", "the cat sat on mat")
0.833

kerb.evaluation.calculate_exact_match(candidate, reference)[source]

Calculate exact match score (1.0 if exact match, 0.0 otherwise).

Parameters:

candidate (str) – The generated text to evaluate
reference (Union[str, List[str]]) – Reference text(s) (ground truth)

Returns:

1.0 if exact match, 0.0 otherwise

Return type:

float

Example

>>> calculate_exact_match("Paris", "Paris")
1.0

kerb.evaluation.calculate_f1_score(candidate, reference)[source]

Calculate token-level F1 score.

Parameters:

candidate (str) – The generated text to evaluate
reference (Union[str, List[str]]) – Reference text(s) (ground truth)

Returns:

F1 score between 0 and 1

Return type:

float

Example

>>> calculate_f1_score("the cat sat", "the cat sat on mat")
0.857

kerb.evaluation.calculate_semantic_similarity(text1, text2, method='embedding')[source]

Calculate semantic similarity between two texts.

Parameters:

text1 (str) – First text
text2 (str) – Second text
method (Union[SimilarityMethod, str]) – Similarity method (SimilarityMethod enum or string: “embedding”, “cosine”, “jaccard”, “bleu”, “rouge”, “bertscore”)

Returns:

Similarity score between 0 and 1

Return type:

float

Examples

>>> from kerb.core.enums import SimilarityMethod
>>> score = calculate_semantic_similarity(text1, text2, method=SimilarityMethod.EMBEDDING)
>>> calculate_semantic_similarity("cat", "kitten", method=SimilarityMethod.JACCARD)
0.0
>>> calculate_semantic_similarity("the cat sat", "the cat sits", method=SimilarityMethod.JACCARD)
0.5

kerb.evaluation.assess_coherence(text)[source]

Assess the coherence and logical flow of text.

Parameters:: text (str) – Text to assess
Returns:: Coherence score and details
Return type:: EvaluationResult

Example

>>> result = assess_coherence("First point. Second point follows. Conclusion makes sense.")
>>> result.score > 0.7
True

kerb.evaluation.assess_fluency(text)[source]

Assess the fluency and naturalness of text.

Parameters:: text (str) – Text to assess
Returns:: Fluency score and details
Return type:: EvaluationResult

Example

>>> result = assess_fluency("This is a well-written sentence.")
>>> result.score > 0.8
True

kerb.evaluation.assess_faithfulness(output, source, method='entailment')[source]

Assess whether output is faithful to the source material.

Parameters:

output (str) – Generated text
source (str) – Source material
method (Union[FaithfulnessMethod, str]) – Assessment method (FaithfulnessMethod enum or string: “entailment”, “nli”, “fact_check”, “llm”)

Returns:

Faithfulness score (1 = fully faithful, 0 = not faithful)

Return type:

EvaluationResult

Examples

>>> from kerb.core.enums import FaithfulnessMethod
>>> result = assess_faithfulness(output, source, method=FaithfulnessMethod.ENTAILMENT)
>>> result.score > 0.7
True

kerb.evaluation.assess_answer_relevance(answer, question, threshold=0.3)[source]

Assess whether an answer is relevant to the question.

Parameters:

answer (str) – The answer text
question (str) – The question text
threshold (float) – Minimum overlap threshold

Returns:

Relevance score

Return type:

EvaluationResult

Example

>>> result = assess_answer_relevance(
...     "Python is a programming language",
...     "What is Python?"
... )
>>> result.score > 0.5
True

kerb.evaluation.detect_hallucination(output, context, threshold=0.3)[source]

Detect potential hallucinations (unfounded claims not supported by context).

Parameters:

output (str) – Generated text to check
context (str) – Source context that should support the output
threshold (float) – Threshold for hallucination detection (lower = stricter)

Returns:

Hallucination score (0 = no hallucination, 1 = likely hallucination)

Return type:

EvaluationResult

Example

>>> result = detect_hallucination(
...     "Paris is the capital of Germany",
...     "Paris is the capital of France"
... )
>>> result.score > 0.5
True

kerb.evaluation.llm_as_judge(output, criterion, context=None, reference=None, scale=5, llm_function=None)[source]

Use an LLM to judge the quality of an output.

Parameters:

output (str) – The text to evaluate
criterion (Union[str, JudgmentCriterion]) – Judgment criterion (relevance, accuracy, coherence, etc.)
context (Optional[str]) – Optional context (e.g., the prompt or question)
reference (Optional[str]) – Optional reference answer
scale (int) – Rating scale (default: 1-5)
llm_function (Optional[Callable]) – Function to call LLM (should accept prompt and return string)

Returns:

Result with score and reasoning

Return type:

EvaluationResult

Example

>>> result = llm_as_judge(
...     "Python is a programming language",
...     JudgmentCriterion.RELEVANCE,
...     context="What is Python?"
... )
>>> result.score
4.5

kerb.evaluation.pairwise_comparison(output_a, output_b, criterion, context=None, llm_function=None)[source]

Compare two outputs using LLM-as-judge.

Parameters:

output_a (str) – First output to compare
output_b (str) – Second output to compare
criterion (str) – Comparison criterion
context (Optional[str]) – Optional context (e.g., the prompt)
llm_function (Optional[Callable]) – Function to call LLM

Returns:

Winner and reasoning

Return type:

ComparisonResult

Example

>>> result = pairwise_comparison(
...     "Python is great",
...     "Python is a high-level programming language",
...     "completeness"
... )
>>> result.winner
'b'

kerb.evaluation.ab_test(outputs_a, outputs_b, evaluation_fn, labels=('A', 'B'))[source]

Perform A/B testing on two sets of outputs.

Parameters:

outputs_a (List[str]) – Outputs from variant A
outputs_b (List[str]) – Outputs from variant B
evaluation_fn (Callable[[str], float]) – Function to score each output (returns float)
labels (Tuple[str, str]) – Labels for variants (default: (“A”, “B”))

Returns:

A/B test results with statistics

Return type:

Dict[str, Any]

Example

>>> results = ab_test(
...     ["Good answer", "Great answer"],
...     ["OK answer", "Bad answer"],
...     lambda x: len(x.split())
... )
>>> results['winner']
'A'

kerb.evaluation.compare_outputs(outputs, metrics=None)[source]

Compare multiple outputs using various metrics.

Parameters:

outputs (List[Tuple[str, str]]) – List of (id, output) tuples
metrics (Optional[List[str]]) – List of metrics to compute (default: all)

Returns:

Comparison results with rankings

Return type:

Dict[str, Any]

Example

>>> results = compare_outputs([
...     ("v1", "Short answer"),
...     ("v2", "This is a longer and more detailed answer")
... ])
>>> results['rankings']['length'][0]
'v2'

kerb.evaluation.run_benchmark(test_cases, generation_fn, evaluation_fn, threshold=0.7, name='benchmark')[source]

Run a benchmark on a set of test cases.

Parameters:

test_cases (List[TestCase]) – List of test cases
generation_fn (Callable[[str], str]) – Function to generate output from input
evaluation_fn (Callable[[str, str], float]) – Function to evaluate output (returns score 0-1)
threshold (float) – Pass threshold (default: 0.7)
name (str) – Benchmark name

Returns:

Benchmark results

Return type:

BenchmarkResult

Example

>>> cases = [TestCase(id="1", input="What is AI?", expected_output="Artificial Intelligence")]
>>> result = run_benchmark(cases, lambda x: "AI means " + x, lambda o, e: 0.8)
>>> result.pass_rate
100.0

kerb.evaluation.benchmark_prompts(prompts, test_inputs, generation_fn, evaluation_fn)[source]

Benchmark multiple prompts against test inputs.

Parameters:

prompts (List[Tuple[str, str]]) – List of (prompt_id, prompt_template) tuples
test_inputs (List[str]) – List of test inputs
generation_fn (Callable[[str, str], str]) – Function(prompt, input) -> output
evaluation_fn (Callable[[str], float]) – Function(output) -> score

Returns:

Benchmark results for each prompt

Return type:

Dict[str, BenchmarkResult]

Example

>>> results = benchmark_prompts(
...     [("v1", "Answer: {input}"), ("v2", "Detailed answer: {input}")],
...     ["What is AI?", "What is ML?"],
...     lambda p, i: p.format(input=i),
...     lambda o: len(o.split()) / 10
... )
>>> len(results)
2

kerb.evaluation.calculate_statistics(scores)[source]

Calculate statistical measures for a list of scores.

Parameters:: scores (List[float]) – List of numeric scores
Returns:: Statistical measures (mean, median, stdev, min, max, percentiles)
Return type:: Dict[str, float]

Example

>>> stats = calculate_statistics([0.5, 0.7, 0.8, 0.9, 1.0])
>>> stats['mean']
0.78

kerb.evaluation.confidence_interval(scores, confidence=0.95)[source]

Calculate confidence interval for scores.

Parameters:

scores (List[float]) – List of scores
confidence (float) – Confidence level (default: 0.95)

Returns:

(lower_bound, upper_bound)

Return type:

Tuple[float, float]

Example

>>> lower, upper = confidence_interval([0.7, 0.8, 0.9])
>>> lower < 0.8 < upper
True

Metrics and benchmarking tools for LLM outputs.