Evaluation Module
Evaluation utilities for LLM applications.
This module provides comprehensive evaluation tools for assessing LLM outputs:
- Ground Truth Metrics:
calculate_bleu() - BLEU score for n-gram overlap calculate_rouge() - ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) calculate_meteor() - METEOR score with precision, recall, and word order calculate_exact_match() - Exact match evaluation calculate_f1_score() - Token-level F1 score calculate_semantic_similarity() - Semantic similarity between texts
- Quality Assessment:
assess_coherence() - Coherence and logical flow assess_fluency() - Fluency and naturalness assess_faithfulness() - Faithfulness to source material assess_answer_relevance() - Answer relevance to question detect_hallucination() - Detect unfounded claims
- LLM-as-Judge:
llm_as_judge() - Use LLM to judge output quality pairwise_comparison() - Compare two outputs using LLM
- A/B Testing:
ab_test() - Statistical A/B testing of outputs compare_outputs() - Multi-output comparison with rankings
- Benchmarking:
run_benchmark() - Run benchmark on test cases benchmark_prompts() - Benchmark multiple prompts
- Statistical Analysis:
calculate_statistics() - Statistical measures (mean, median, stdev, etc.) confidence_interval() - Confidence intervals for scores
- Data Classes:
EvaluationResult - Result with score and details ComparisonResult - Result of comparing outputs BenchmarkResult - Benchmark run results TestCase - Test case definition
- Enums:
EvaluationMetric - Standard evaluation metrics JudgmentCriterion - Criteria for LLM-as-judge
Examples
>>> # Common usage - core classes and metrics
>>> from kerb.evaluation import EvaluationResult, calculate_bleu, calculate_rouge
>>>
>>> # Specialized imports - metrics module
>>> from kerb.evaluation.metrics import (
... calculate_meteor,
... calculate_semantic_similarity
... )
>>>
>>> # Quality assessment
>>> from kerb.evaluation.quality import (
... assess_coherence,
... detect_hallucination
... )
>>>
>>> # LLM-as-judge
>>> from kerb.evaluation.judges import llm_as_judge, pairwise_comparison
>>>
>>> # Benchmarking
>>> from kerb.evaluation.benchmarks import run_benchmark
- class kerb.evaluation.EvaluationResult(metric, score, details=<factory>, passed=None)[source]
Bases:
objectResult of an evaluation with score and details.
- __init__(metric, score, details=<factory>, passed=None)
- class kerb.evaluation.ComparisonResult(output_a_id, output_b_id, winner, scores, confidence=0.0, reasoning='')[source]
Bases:
objectResult of comparing two outputs.
- __init__(output_a_id, output_b_id, winner, scores, confidence=0.0, reasoning='')
- class kerb.evaluation.BenchmarkResult(name, total_tests, passed_tests, failed_tests, average_score, scores, execution_time=0.0, details=<factory>)[source]
Bases:
objectResult of a benchmark run.
- __init__(name, total_tests, passed_tests, failed_tests, average_score, scores, execution_time=0.0, details=<factory>)
- class kerb.evaluation.TestCase(id, input, expected_output=None, metadata=<factory>, reference_outputs=<factory>)[source]
Bases:
objectA single test case for evaluation.
- __init__(id, input, expected_output=None, metadata=<factory>, reference_outputs=<factory>)
- class kerb.evaluation.EvaluationMetric(*values)[source]
Bases:
EnumStandard evaluation metrics.
- BLEU = 'bleu'
- ROUGE_1 = 'rouge-1'
- ROUGE_2 = 'rouge-2'
- ROUGE_L = 'rouge-l'
- METEOR = 'meteor'
- BERTSCORE = 'bertscore'
- EXACT_MATCH = 'exact_match'
- F1 = 'f1'
- SEMANTIC_SIMILARITY = 'semantic_similarity'
- class kerb.evaluation.JudgmentCriterion(*values)[source]
Bases:
EnumCriteria for LLM-as-judge evaluation.
- RELEVANCE = 'relevance'
- ACCURACY = 'accuracy'
- COMPLETENESS = 'completeness'
- COHERENCE = 'coherence'
- FLUENCY = 'fluency'
- HELPFULNESS = 'helpfulness'
- HARMLESSNESS = 'harmlessness'
- FAITHFULNESS = 'faithfulness'
- CONSISTENCY = 'consistency'
- kerb.evaluation.calculate_bleu(candidate, reference, n=4, weights=None)[source]
Calculate BLEU score between candidate and reference text(s).
BLEU (Bilingual Evaluation Understudy) measures n-gram overlap with brevity penalty.
- Parameters:
- Returns:
BLEU score between 0 and 1
- Return type:
Example
>>> calculate_bleu("the cat sat", "the cat sat on mat") 0.7598
- kerb.evaluation.calculate_rouge(candidate, reference, rouge_type='rouge-l', beta=1.2)[source]
Calculate ROUGE scores between candidate and reference text(s).
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures recall of n-grams.
- Parameters:
- Returns:
Dictionary with ‘precision’, ‘recall’, ‘fmeasure’ scores
- Return type:
Example
>>> calculate_rouge("the cat sat", "the cat sat on mat", "rouge-1") {'precision': 1.0, 'recall': 0.6, 'fmeasure': 0.75}
- kerb.evaluation.calculate_meteor(candidate, reference, alpha=0.9, beta=3.0, gamma=0.5)[source]
Calculate METEOR score (simplified version without stemming/synonyms).
METEOR considers precision, recall, and word order with harmonic mean.
- Parameters:
candidate (
str) – The generated text to evaluatereference (
Union[str,List[str]]) – Reference text(s) (ground truth)alpha (
float) – Weight for recall vs precision (default: 0.9)beta (
float) – Shape parameter for f-mean (default: 3.0)gamma (
float) – Penalty weight for fragmentation (default: 0.5)
- Returns:
METEOR score between 0 and 1
- Return type:
Example
>>> calculate_meteor("the cat sat", "the cat sat on mat") 0.833
- kerb.evaluation.calculate_exact_match(candidate, reference)[source]
Calculate exact match score (1.0 if exact match, 0.0 otherwise).
- Parameters:
- Returns:
1.0 if exact match, 0.0 otherwise
- Return type:
Example
>>> calculate_exact_match("Paris", "Paris") 1.0
- kerb.evaluation.calculate_f1_score(candidate, reference)[source]
Calculate token-level F1 score.
- Parameters:
- Returns:
F1 score between 0 and 1
- Return type:
Example
>>> calculate_f1_score("the cat sat", "the cat sat on mat") 0.857
- kerb.evaluation.calculate_semantic_similarity(text1, text2, method='embedding')[source]
Calculate semantic similarity between two texts.
- Parameters:
- Returns:
Similarity score between 0 and 1
- Return type:
Examples
>>> from kerb.core.enums import SimilarityMethod >>> score = calculate_semantic_similarity(text1, text2, method=SimilarityMethod.EMBEDDING) >>> calculate_semantic_similarity("cat", "kitten", method=SimilarityMethod.JACCARD) 0.0 >>> calculate_semantic_similarity("the cat sat", "the cat sits", method=SimilarityMethod.JACCARD) 0.5
- kerb.evaluation.assess_coherence(text)[source]
Assess the coherence and logical flow of text.
- Parameters:
text (
str) – Text to assess- Returns:
Coherence score and details
- Return type:
Example
>>> result = assess_coherence("First point. Second point follows. Conclusion makes sense.") >>> result.score > 0.7 True
- kerb.evaluation.assess_fluency(text)[source]
Assess the fluency and naturalness of text.
- Parameters:
text (
str) – Text to assess- Returns:
Fluency score and details
- Return type:
Example
>>> result = assess_fluency("This is a well-written sentence.") >>> result.score > 0.8 True
- kerb.evaluation.assess_faithfulness(output, source, method='entailment')[source]
Assess whether output is faithful to the source material.
- Parameters:
- Returns:
Faithfulness score (1 = fully faithful, 0 = not faithful)
- Return type:
Examples
>>> from kerb.core.enums import FaithfulnessMethod >>> result = assess_faithfulness(output, source, method=FaithfulnessMethod.ENTAILMENT) >>> result.score > 0.7 True
- kerb.evaluation.assess_answer_relevance(answer, question, threshold=0.3)[source]
Assess whether an answer is relevant to the question.
- Parameters:
- Returns:
Relevance score
- Return type:
Example
>>> result = assess_answer_relevance( ... "Python is a programming language", ... "What is Python?" ... ) >>> result.score > 0.5 True
- kerb.evaluation.detect_hallucination(output, context, threshold=0.3)[source]
Detect potential hallucinations (unfounded claims not supported by context).
- Parameters:
- Returns:
Hallucination score (0 = no hallucination, 1 = likely hallucination)
- Return type:
Example
>>> result = detect_hallucination( ... "Paris is the capital of Germany", ... "Paris is the capital of France" ... ) >>> result.score > 0.5 True
- kerb.evaluation.llm_as_judge(output, criterion, context=None, reference=None, scale=5, llm_function=None)[source]
Use an LLM to judge the quality of an output.
- Parameters:
output (
str) – The text to evaluatecriterion (
Union[str,JudgmentCriterion]) – Judgment criterion (relevance, accuracy, coherence, etc.)context (
Optional[str]) – Optional context (e.g., the prompt or question)scale (
int) – Rating scale (default: 1-5)llm_function (
Optional[Callable]) – Function to call LLM (should accept prompt and return string)
- Returns:
Result with score and reasoning
- Return type:
Example
>>> result = llm_as_judge( ... "Python is a programming language", ... JudgmentCriterion.RELEVANCE, ... context="What is Python?" ... ) >>> result.score 4.5
- kerb.evaluation.pairwise_comparison(output_a, output_b, criterion, context=None, llm_function=None)[source]
Compare two outputs using LLM-as-judge.
- Parameters:
- Returns:
Winner and reasoning
- Return type:
Example
>>> result = pairwise_comparison( ... "Python is great", ... "Python is a high-level programming language", ... "completeness" ... ) >>> result.winner 'b'
- kerb.evaluation.ab_test(outputs_a, outputs_b, evaluation_fn, labels=('A', 'B'))[source]
Perform A/B testing on two sets of outputs.
- Parameters:
- Returns:
A/B test results with statistics
- Return type:
Example
>>> results = ab_test( ... ["Good answer", "Great answer"], ... ["OK answer", "Bad answer"], ... lambda x: len(x.split()) ... ) >>> results['winner'] 'A'
- kerb.evaluation.compare_outputs(outputs, metrics=None)[source]
Compare multiple outputs using various metrics.
- Parameters:
- Returns:
Comparison results with rankings
- Return type:
Example
>>> results = compare_outputs([ ... ("v1", "Short answer"), ... ("v2", "This is a longer and more detailed answer") ... ]) >>> results['rankings']['length'][0] 'v2'
- kerb.evaluation.run_benchmark(test_cases, generation_fn, evaluation_fn, threshold=0.7, name='benchmark')[source]
Run a benchmark on a set of test cases.
- Parameters:
- Returns:
Benchmark results
- Return type:
Example
>>> cases = [TestCase(id="1", input="What is AI?", expected_output="Artificial Intelligence")] >>> result = run_benchmark(cases, lambda x: "AI means " + x, lambda o, e: 0.8) >>> result.pass_rate 100.0
- kerb.evaluation.benchmark_prompts(prompts, test_inputs, generation_fn, evaluation_fn)[source]
Benchmark multiple prompts against test inputs.
- Parameters:
- Returns:
Benchmark results for each prompt
- Return type:
Example
>>> results = benchmark_prompts( ... [("v1", "Answer: {input}"), ("v2", "Detailed answer: {input}")], ... ["What is AI?", "What is ML?"], ... lambda p, i: p.format(input=i), ... lambda o: len(o.split()) / 10 ... ) >>> len(results) 2
- kerb.evaluation.calculate_statistics(scores)[source]
Calculate statistical measures for a list of scores.
- Parameters:
- Returns:
Statistical measures (mean, median, stdev, min, max, percentiles)
- Return type:
Example
>>> stats = calculate_statistics([0.5, 0.7, 0.8, 0.9, 1.0]) >>> stats['mean'] 0.78
- kerb.evaluation.confidence_interval(scores, confidence=0.95)[source]
Calculate confidence interval for scores.
- Parameters:
- Returns:
(lower_bound, upper_bound)
- Return type:
Example
>>> lower, upper = confidence_interval([0.7, 0.8, 0.9]) >>> lower < 0.8 < upper True
Metrics and benchmarking tools for LLM outputs.