Testing Module
Testing utilities for LLM applications.
This module provides comprehensive testing tools for LLM development:
- Mock LLM Responses:
MockLLM - Mock LLM provider with configurable responses MockStreamingLLM - Mock streaming LLM responses create_mock_llm() - Helper to create mock LLM instances
- Response Fixtures:
PromptFixture - Fixture for prompt-response pairs ResponseFixture - Fixture for deterministic responses FixtureManager - Manage and organize test fixtures load_fixtures() - Load fixtures from file save_fixtures() - Save fixtures to file
- Deterministic Testing:
DeterministicResponseGenerator - Generate consistent test responses SeededResponseGenerator - Seeded random responses PatternResponseGenerator - Pattern-based responses
- Response Recording:
ResponseRecorder - Record actual LLM responses RecordingSession - Context manager for recording sessions replay_responses() - Replay recorded responses
- Dataset Management:
TestDataset - Dataset for evaluation testing create_dataset() - Create test datasets load_dataset() - Load datasets from various formats split_dataset() - Split into train/val/test augment_dataset() - Augment datasets with variations
- Prompt Testing:
PromptTestCase - Single prompt test case PromptTestSuite - Collection of prompt tests run_prompt_regression() - Run regression tests on prompts compare_prompt_versions() - Compare prompt versions
- Assertion Helpers:
assert_response_contains() - Check if response contains text assert_response_matches() - Check regex match assert_response_json() - Validate JSON response assert_response_length() - Check response length assert_response_quality() - Quality assertions assert_no_hallucination() - Check for hallucinations assert_safety_compliance() - Check safety guidelines
- Output Validators:
validate_json_schema() - Validate JSON against schema validate_code_syntax() - Validate code syntax validate_format() - Validate output format validate_consistency() - Check consistency across generations
- Diff and Comparison:
diff_responses() - Diff two responses compare_responses() - Compare multiple responses highlight_differences() - Highlight key differences
- Snapshot Testing:
SnapshotManager - Manage response snapshots create_snapshot() - Create snapshot from response compare_snapshot() - Compare against snapshot update_snapshot() - Update existing snapshot
- Test Doubles:
StubEmbedding - Stub embedding model StubRetriever - Stub retrieval system StubVectorStore - Stub vector store create_test_double() - Factory for test doubles
- Performance Testing:
measure_latency() - Measure response latency measure_throughput() - Measure throughput benchmark_prompts() - Benchmark prompt performance PerformanceReport - Performance test report
- Cost Tracking:
CostTracker - Track testing costs estimate_test_cost() - Estimate cost before running get_cost_report() - Get cost breakdown
- Utilities:
seed_randomness() - Set random seed for reproducibility capture_warnings() - Capture warning messages isolate_test() - Isolation context manager cleanup_resources() - Clean up test resources
- Data Classes:
MockResponse - Mock LLM response TestCase - Test case definition TestResult - Test execution result FixtureData - Fixture data container
- Enums:
MockBehavior - Mock behavior modes FixtureFormat - Fixture file formats
- class kerb.testing.MockBehavior(*values)[source]
Bases:
EnumBehavior modes for mock LLM.
- FIXED = 'fixed'
- SEQUENTIAL = 'sequential'
- RANDOM = 'random'
- PATTERN = 'pattern'
- CALLABLE = 'callable'
- class kerb.testing.FixtureFormat(*values)[source]
Bases:
EnumSupported fixture file formats.
- JSON = 'json'
- JSONL = 'jsonl'
- CSV = 'csv'
- YAML = 'yaml'
- class kerb.testing.MockResponse(content, model='mock-model', finish_reason='stop', prompt_tokens=0, completion_tokens=0, latency=0.1, metadata=<factory>)[source]
Bases:
objectMock LLM response.
- __init__(content, model='mock-model', finish_reason='stop', prompt_tokens=0, completion_tokens=0, latency=0.1, metadata=<factory>)
- class kerb.testing.TestCase(id, prompt, expected_output=None, expected_patterns=None, metadata=<factory>, validation_fn=None)[source]
Bases:
objectTest case definition.
- __init__(id, prompt, expected_output=None, expected_patterns=None, metadata=<factory>, validation_fn=None)
- class kerb.testing.TestResult(test_id, passed, actual_output, expected_output=None, error=None, latency=0.0, timestamp=<factory>, metadata=<factory>)[source]
Bases:
objectTest execution result.
- __init__(test_id, passed, actual_output, expected_output=None, error=None, latency=0.0, timestamp=<factory>, metadata=<factory>)
- class kerb.testing.FixtureData(prompt, response, metadata=<factory>, tags=<factory>, created_at=<factory>)[source]
Bases:
objectContainer for fixture data.
- __init__(prompt, response, metadata=<factory>, tags=<factory>, created_at=<factory>)
- class kerb.testing.PromptFixture(id, prompt, expected_response, variables=<factory>, metadata=<factory>)[source]
Bases:
objectFixture for prompt-response pairs.
- __init__(id, prompt, expected_response, variables=<factory>, metadata=<factory>)
- class kerb.testing.ResponseFixture(pattern, response, response_type='exact', metadata=<factory>)[source]
Bases:
objectFixture for deterministic responses.
- __init__(pattern, response, response_type='exact', metadata=<factory>)
- class kerb.testing.PromptTestCase(name, prompt_template, test_inputs, expected_outputs=None, validators=<factory>, metadata=<factory>)[source]
Bases:
objectPrompt test case for regression testing.
- __init__(name, prompt_template, test_inputs, expected_outputs=None, validators=<factory>, metadata=<factory>)
- class kerb.testing.SnapshotData(name, content, hash, created_at=<factory>, metadata=<factory>)[source]
Bases:
objectSnapshot data for snapshot testing.
- __init__(name, content, hash, created_at=<factory>, metadata=<factory>)
- class kerb.testing.PerformanceMetrics(total_requests, total_latency, avg_latency, min_latency, max_latency, p50_latency, p95_latency, p99_latency, throughput, tokens_per_second, metadata=<factory>)[source]
Bases:
objectPerformance metrics for testing.
- __init__(total_requests, total_latency, avg_latency, min_latency, max_latency, p50_latency, p95_latency, p99_latency, throughput, tokens_per_second, metadata=<factory>)
- class kerb.testing.CostReport(total_cost, total_tokens, total_requests, cost_by_model, tokens_by_model, timestamp=<factory>)[source]
Bases:
objectCost tracking report.
- __init__(total_cost, total_tokens, total_requests, cost_by_model, tokens_by_model, timestamp=<factory>)
- class kerb.testing.MockLLM(responses=None, behavior=MockBehavior.FIXED, default_response='Mock response', latency=0.1, token_calculator=None)[source]
Bases:
objectMock LLM provider with configurable responses.
This class provides a drop-in replacement for real LLM providers, useful for testing without making actual API calls.
- __init__(responses=None, behavior=MockBehavior.FIXED, default_response='Mock response', latency=0.1, token_calculator=None)[source]
Initialize mock LLM.
- Parameters:
responses (
Union[str,List[str],Dict[str,str],None]) – Response(s) to returnbehavior (
MockBehavior) – Behavior mode for returning responsesdefault_response (
str) – Default response when no match foundlatency (
float) – Simulated latency per responsetoken_calculator (
Optional[Callable[[str],int]]) – Function to calculate token counts
- class kerb.testing.MockStreamingLLM(response, chunk_size=10, delay_per_chunk=0.01)[source]
Bases:
objectMock streaming LLM for testing streaming responses.
- kerb.testing.create_mock_llm(responses, behavior=MockBehavior.FIXED, **kwargs)[source]
Helper to create a mock LLM instance.
- class kerb.testing.FixtureManager(fixtures_dir=None)[source]
Bases:
objectManage and organize test fixtures.
- kerb.testing.load_fixtures(filepath)[source]
Load fixtures from a file.
- Parameters:
filepath (
Path) – Path to fixture file- Return type:
- Returns:
Dictionary of fixture name to FixtureData
- kerb.testing.save_fixtures(fixtures, filepath, format=FixtureFormat.JSON)[source]
Save fixtures to a file.
- Parameters:
fixtures (
Dict[str,FixtureData]) – Fixtures to savefilepath (
Path) – Output filepathformat (
FixtureFormat) – Output format
- Return type:
- class kerb.testing.TestDataset(name, examples=None)[source]
Bases:
objectDataset for evaluation testing.
- kerb.testing.load_dataset(filepath)[source]
Load dataset from file.
- Parameters:
filepath (
Path) – Path to dataset file- Return type:
- Returns:
TestDataset instance
- kerb.testing.assert_response_contains(response, expected, case_sensitive=False)[source]
Assert that response contains expected text.
- kerb.testing.assert_response_json(response, expected_schema=None)[source]
Assert that response is valid JSON.
- kerb.testing.assert_response_quality(response, min_words=None, no_repetition=False, no_empty_lines=False)[source]
Assert response quality metrics.
Testing utilities for LLM outputs and evaluation.