Evaluation Module

Evaluation utilities for LLM applications.

This module provides comprehensive evaluation tools for assessing LLM outputs:

Ground Truth Metrics:

calculate_bleu() - BLEU score for n-gram overlap calculate_rouge() - ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) calculate_meteor() - METEOR score with precision, recall, and word order calculate_exact_match() - Exact match evaluation calculate_f1_score() - Token-level F1 score calculate_semantic_similarity() - Semantic similarity between texts

Quality Assessment:

assess_coherence() - Coherence and logical flow assess_fluency() - Fluency and naturalness assess_faithfulness() - Faithfulness to source material assess_answer_relevance() - Answer relevance to question detect_hallucination() - Detect unfounded claims

LLM-as-Judge:

llm_as_judge() - Use LLM to judge output quality pairwise_comparison() - Compare two outputs using LLM

A/B Testing:

ab_test() - Statistical A/B testing of outputs compare_outputs() - Multi-output comparison with rankings

Benchmarking:

run_benchmark() - Run benchmark on test cases benchmark_prompts() - Benchmark multiple prompts

Statistical Analysis:

calculate_statistics() - Statistical measures (mean, median, stdev, etc.) confidence_interval() - Confidence intervals for scores

Data Classes:

EvaluationResult - Result with score and details ComparisonResult - Result of comparing outputs BenchmarkResult - Benchmark run results TestCase - Test case definition

Enums:

EvaluationMetric - Standard evaluation metrics JudgmentCriterion - Criteria for LLM-as-judge

Examples

>>> # Common usage - core classes and metrics
>>> from kerb.evaluation import EvaluationResult, calculate_bleu, calculate_rouge
>>>
>>> # Specialized imports - metrics module
>>> from kerb.evaluation.metrics import (
...     calculate_meteor,
...     calculate_semantic_similarity
... )
>>>
>>> # Quality assessment
>>> from kerb.evaluation.quality import (
...     assess_coherence,
...     detect_hallucination
... )
>>>
>>> # LLM-as-judge
>>> from kerb.evaluation.judges import llm_as_judge, pairwise_comparison
>>>
>>> # Benchmarking
>>> from kerb.evaluation.benchmarks import run_benchmark
class kerb.evaluation.EvaluationResult(metric, score, details=<factory>, passed=None)[source]

Bases: object

Result of an evaluation with score and details.

metric: str
score: float
details: Dict[str, Any]
passed: bool | None = None
__init__(metric, score, details=<factory>, passed=None)
class kerb.evaluation.ComparisonResult(output_a_id, output_b_id, winner, scores, confidence=0.0, reasoning='')[source]

Bases: object

Result of comparing two outputs.

output_a_id: str
output_b_id: str
winner: str | None
scores: Dict[str, float]
confidence: float = 0.0
reasoning: str = ''
__init__(output_a_id, output_b_id, winner, scores, confidence=0.0, reasoning='')
class kerb.evaluation.BenchmarkResult(name, total_tests, passed_tests, failed_tests, average_score, scores, execution_time=0.0, details=<factory>)[source]

Bases: object

Result of a benchmark run.

name: str
total_tests: int
passed_tests: int
failed_tests: int
average_score: float
scores: List[float]
execution_time: float = 0.0
details: Dict[str, Any]
property pass_rate: float

Calculate pass rate percentage.

__init__(name, total_tests, passed_tests, failed_tests, average_score, scores, execution_time=0.0, details=<factory>)
class kerb.evaluation.TestCase(id, input, expected_output=None, metadata=<factory>, reference_outputs=<factory>)[source]

Bases: object

A single test case for evaluation.

id: str
input: str
expected_output: str | None = None
metadata: Dict[str, Any]
reference_outputs: List[str]
__init__(id, input, expected_output=None, metadata=<factory>, reference_outputs=<factory>)
class kerb.evaluation.EvaluationMetric(*values)[source]

Bases: Enum

Standard evaluation metrics.

BLEU = 'bleu'
ROUGE_1 = 'rouge-1'
ROUGE_2 = 'rouge-2'
ROUGE_L = 'rouge-l'
METEOR = 'meteor'
BERTSCORE = 'bertscore'
EXACT_MATCH = 'exact_match'
F1 = 'f1'
SEMANTIC_SIMILARITY = 'semantic_similarity'
class kerb.evaluation.JudgmentCriterion(*values)[source]

Bases: Enum

Criteria for LLM-as-judge evaluation.

RELEVANCE = 'relevance'
ACCURACY = 'accuracy'
COMPLETENESS = 'completeness'
COHERENCE = 'coherence'
FLUENCY = 'fluency'
HELPFULNESS = 'helpfulness'
HARMLESSNESS = 'harmlessness'
FAITHFULNESS = 'faithfulness'
CONSISTENCY = 'consistency'
kerb.evaluation.calculate_bleu(candidate, reference, n=4, weights=None)[source]

Calculate BLEU score between candidate and reference text(s).

BLEU (Bilingual Evaluation Understudy) measures n-gram overlap with brevity penalty.

Parameters:
  • candidate (str) – The generated text to evaluate

  • reference (Union[str, List[str]]) – Reference text(s) (ground truth)

  • n (int) – Maximum n-gram length (default: 4 for BLEU-4)

  • weights (Optional[List[float]]) – Weights for each n-gram (default: equal weights)

Returns:

BLEU score between 0 and 1

Return type:

float

Example

>>> calculate_bleu("the cat sat", "the cat sat on mat")
0.7598
kerb.evaluation.calculate_rouge(candidate, reference, rouge_type='rouge-l', beta=1.2)[source]

Calculate ROUGE scores between candidate and reference text(s).

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures recall of n-grams.

Parameters:
  • candidate (str) – The generated text to evaluate

  • reference (Union[str, List[str]]) – Reference text(s) (ground truth)

  • rouge_type (str) – Type of ROUGE - “rouge-1”, “rouge-2”, “rouge-l”

  • beta (float) – Beta parameter for F-measure (default: 1.2 favors recall)

Returns:

Dictionary with ‘precision’, ‘recall’, ‘fmeasure’ scores

Return type:

Dict[str, float]

Example

>>> calculate_rouge("the cat sat", "the cat sat on mat", "rouge-1")
{'precision': 1.0, 'recall': 0.6, 'fmeasure': 0.75}
kerb.evaluation.calculate_meteor(candidate, reference, alpha=0.9, beta=3.0, gamma=0.5)[source]

Calculate METEOR score (simplified version without stemming/synonyms).

METEOR considers precision, recall, and word order with harmonic mean.

Parameters:
  • candidate (str) – The generated text to evaluate

  • reference (Union[str, List[str]]) – Reference text(s) (ground truth)

  • alpha (float) – Weight for recall vs precision (default: 0.9)

  • beta (float) – Shape parameter for f-mean (default: 3.0)

  • gamma (float) – Penalty weight for fragmentation (default: 0.5)

Returns:

METEOR score between 0 and 1

Return type:

float

Example

>>> calculate_meteor("the cat sat", "the cat sat on mat")
0.833
kerb.evaluation.calculate_exact_match(candidate, reference)[source]

Calculate exact match score (1.0 if exact match, 0.0 otherwise).

Parameters:
  • candidate (str) – The generated text to evaluate

  • reference (Union[str, List[str]]) – Reference text(s) (ground truth)

Returns:

1.0 if exact match, 0.0 otherwise

Return type:

float

Example

>>> calculate_exact_match("Paris", "Paris")
1.0
kerb.evaluation.calculate_f1_score(candidate, reference)[source]

Calculate token-level F1 score.

Parameters:
  • candidate (str) – The generated text to evaluate

  • reference (Union[str, List[str]]) – Reference text(s) (ground truth)

Returns:

F1 score between 0 and 1

Return type:

float

Example

>>> calculate_f1_score("the cat sat", "the cat sat on mat")
0.857
kerb.evaluation.calculate_semantic_similarity(text1, text2, method='embedding')[source]

Calculate semantic similarity between two texts.

Parameters:
  • text1 (str) – First text

  • text2 (str) – Second text

  • method (Union[SimilarityMethod, str]) – Similarity method (SimilarityMethod enum or string: “embedding”, “cosine”, “jaccard”, “bleu”, “rouge”, “bertscore”)

Returns:

Similarity score between 0 and 1

Return type:

float

Examples

>>> from kerb.core.enums import SimilarityMethod
>>> score = calculate_semantic_similarity(text1, text2, method=SimilarityMethod.EMBEDDING)
>>> calculate_semantic_similarity("cat", "kitten", method=SimilarityMethod.JACCARD)
0.0
>>> calculate_semantic_similarity("the cat sat", "the cat sits", method=SimilarityMethod.JACCARD)
0.5
kerb.evaluation.assess_coherence(text)[source]

Assess the coherence and logical flow of text.

Parameters:

text (str) – Text to assess

Returns:

Coherence score and details

Return type:

EvaluationResult

Example

>>> result = assess_coherence("First point. Second point follows. Conclusion makes sense.")
>>> result.score > 0.7
True
kerb.evaluation.assess_fluency(text)[source]

Assess the fluency and naturalness of text.

Parameters:

text (str) – Text to assess

Returns:

Fluency score and details

Return type:

EvaluationResult

Example

>>> result = assess_fluency("This is a well-written sentence.")
>>> result.score > 0.8
True
kerb.evaluation.assess_faithfulness(output, source, method='entailment')[source]

Assess whether output is faithful to the source material.

Parameters:
  • output (str) – Generated text

  • source (str) – Source material

  • method (Union[FaithfulnessMethod, str]) – Assessment method (FaithfulnessMethod enum or string: “entailment”, “nli”, “fact_check”, “llm”)

Returns:

Faithfulness score (1 = fully faithful, 0 = not faithful)

Return type:

EvaluationResult

Examples

>>> from kerb.core.enums import FaithfulnessMethod
>>> result = assess_faithfulness(output, source, method=FaithfulnessMethod.ENTAILMENT)
>>> result.score > 0.7
True
kerb.evaluation.assess_answer_relevance(answer, question, threshold=0.3)[source]

Assess whether an answer is relevant to the question.

Parameters:
  • answer (str) – The answer text

  • question (str) – The question text

  • threshold (float) – Minimum overlap threshold

Returns:

Relevance score

Return type:

EvaluationResult

Example

>>> result = assess_answer_relevance(
...     "Python is a programming language",
...     "What is Python?"
... )
>>> result.score > 0.5
True
kerb.evaluation.detect_hallucination(output, context, threshold=0.3)[source]

Detect potential hallucinations (unfounded claims not supported by context).

Parameters:
  • output (str) – Generated text to check

  • context (str) – Source context that should support the output

  • threshold (float) – Threshold for hallucination detection (lower = stricter)

Returns:

Hallucination score (0 = no hallucination, 1 = likely hallucination)

Return type:

EvaluationResult

Example

>>> result = detect_hallucination(
...     "Paris is the capital of Germany",
...     "Paris is the capital of France"
... )
>>> result.score > 0.5
True
kerb.evaluation.llm_as_judge(output, criterion, context=None, reference=None, scale=5, llm_function=None)[source]

Use an LLM to judge the quality of an output.

Parameters:
  • output (str) – The text to evaluate

  • criterion (Union[str, JudgmentCriterion]) – Judgment criterion (relevance, accuracy, coherence, etc.)

  • context (Optional[str]) – Optional context (e.g., the prompt or question)

  • reference (Optional[str]) – Optional reference answer

  • scale (int) – Rating scale (default: 1-5)

  • llm_function (Optional[Callable]) – Function to call LLM (should accept prompt and return string)

Returns:

Result with score and reasoning

Return type:

EvaluationResult

Example

>>> result = llm_as_judge(
...     "Python is a programming language",
...     JudgmentCriterion.RELEVANCE,
...     context="What is Python?"
... )
>>> result.score
4.5
kerb.evaluation.pairwise_comparison(output_a, output_b, criterion, context=None, llm_function=None)[source]

Compare two outputs using LLM-as-judge.

Parameters:
  • output_a (str) – First output to compare

  • output_b (str) – Second output to compare

  • criterion (str) – Comparison criterion

  • context (Optional[str]) – Optional context (e.g., the prompt)

  • llm_function (Optional[Callable]) – Function to call LLM

Returns:

Winner and reasoning

Return type:

ComparisonResult

Example

>>> result = pairwise_comparison(
...     "Python is great",
...     "Python is a high-level programming language",
...     "completeness"
... )
>>> result.winner
'b'
kerb.evaluation.ab_test(outputs_a, outputs_b, evaluation_fn, labels=('A', 'B'))[source]

Perform A/B testing on two sets of outputs.

Parameters:
  • outputs_a (List[str]) – Outputs from variant A

  • outputs_b (List[str]) – Outputs from variant B

  • evaluation_fn (Callable[[str], float]) – Function to score each output (returns float)

  • labels (Tuple[str, str]) – Labels for variants (default: (“A”, “B”))

Returns:

A/B test results with statistics

Return type:

Dict[str, Any]

Example

>>> results = ab_test(
...     ["Good answer", "Great answer"],
...     ["OK answer", "Bad answer"],
...     lambda x: len(x.split())
... )
>>> results['winner']
'A'
kerb.evaluation.compare_outputs(outputs, metrics=None)[source]

Compare multiple outputs using various metrics.

Parameters:
Returns:

Comparison results with rankings

Return type:

Dict[str, Any]

Example

>>> results = compare_outputs([
...     ("v1", "Short answer"),
...     ("v2", "This is a longer and more detailed answer")
... ])
>>> results['rankings']['length'][0]
'v2'
kerb.evaluation.run_benchmark(test_cases, generation_fn, evaluation_fn, threshold=0.7, name='benchmark')[source]

Run a benchmark on a set of test cases.

Parameters:
  • test_cases (List[TestCase]) – List of test cases

  • generation_fn (Callable[[str], str]) – Function to generate output from input

  • evaluation_fn (Callable[[str, str], float]) – Function to evaluate output (returns score 0-1)

  • threshold (float) – Pass threshold (default: 0.7)

  • name (str) – Benchmark name

Returns:

Benchmark results

Return type:

BenchmarkResult

Example

>>> cases = [TestCase(id="1", input="What is AI?", expected_output="Artificial Intelligence")]
>>> result = run_benchmark(cases, lambda x: "AI means " + x, lambda o, e: 0.8)
>>> result.pass_rate
100.0
kerb.evaluation.benchmark_prompts(prompts, test_inputs, generation_fn, evaluation_fn)[source]

Benchmark multiple prompts against test inputs.

Parameters:
  • prompts (List[Tuple[str, str]]) – List of (prompt_id, prompt_template) tuples

  • test_inputs (List[str]) – List of test inputs

  • generation_fn (Callable[[str, str], str]) – Function(prompt, input) -> output

  • evaluation_fn (Callable[[str], float]) – Function(output) -> score

Returns:

Benchmark results for each prompt

Return type:

Dict[str, BenchmarkResult]

Example

>>> results = benchmark_prompts(
...     [("v1", "Answer: {input}"), ("v2", "Detailed answer: {input}")],
...     ["What is AI?", "What is ML?"],
...     lambda p, i: p.format(input=i),
...     lambda o: len(o.split()) / 10
... )
>>> len(results)
2
kerb.evaluation.calculate_statistics(scores)[source]

Calculate statistical measures for a list of scores.

Parameters:

scores (List[float]) – List of numeric scores

Returns:

Statistical measures (mean, median, stdev, min, max, percentiles)

Return type:

Dict[str, float]

Example

>>> stats = calculate_statistics([0.5, 0.7, 0.8, 0.9, 1.0])
>>> stats['mean']
0.78
kerb.evaluation.confidence_interval(scores, confidence=0.95)[source]

Calculate confidence interval for scores.

Parameters:
  • scores (List[float]) – List of scores

  • confidence (float) – Confidence level (default: 0.95)

Returns:

(lower_bound, upper_bound)

Return type:

Tuple[float, float]

Example

>>> lower, upper = confidence_interval([0.7, 0.8, 0.9])
>>> lower < 0.8 < upper
True

Metrics and benchmarking tools for LLM outputs.