Testing Module

Testing utilities for LLM applications.

This module provides comprehensive testing tools for LLM development:

Mock LLM Responses:

MockLLM - Mock LLM provider with configurable responses MockStreamingLLM - Mock streaming LLM responses create_mock_llm() - Helper to create mock LLM instances

Response Fixtures:

PromptFixture - Fixture for prompt-response pairs ResponseFixture - Fixture for deterministic responses FixtureManager - Manage and organize test fixtures load_fixtures() - Load fixtures from file save_fixtures() - Save fixtures to file

Deterministic Testing:

DeterministicResponseGenerator - Generate consistent test responses SeededResponseGenerator - Seeded random responses PatternResponseGenerator - Pattern-based responses

Response Recording:

ResponseRecorder - Record actual LLM responses RecordingSession - Context manager for recording sessions replay_responses() - Replay recorded responses

Dataset Management:

TestDataset - Dataset for evaluation testing create_dataset() - Create test datasets load_dataset() - Load datasets from various formats split_dataset() - Split into train/val/test augment_dataset() - Augment datasets with variations

Prompt Testing:

PromptTestCase - Single prompt test case PromptTestSuite - Collection of prompt tests run_prompt_regression() - Run regression tests on prompts compare_prompt_versions() - Compare prompt versions

Assertion Helpers:

assert_response_contains() - Check if response contains text assert_response_matches() - Check regex match assert_response_json() - Validate JSON response assert_response_length() - Check response length assert_response_quality() - Quality assertions assert_no_hallucination() - Check for hallucinations assert_safety_compliance() - Check safety guidelines

Output Validators:

validate_json_schema() - Validate JSON against schema validate_code_syntax() - Validate code syntax validate_format() - Validate output format validate_consistency() - Check consistency across generations

Diff and Comparison:

diff_responses() - Diff two responses compare_responses() - Compare multiple responses highlight_differences() - Highlight key differences

Snapshot Testing:

SnapshotManager - Manage response snapshots create_snapshot() - Create snapshot from response compare_snapshot() - Compare against snapshot update_snapshot() - Update existing snapshot

Test Doubles:

StubEmbedding - Stub embedding model StubRetriever - Stub retrieval system StubVectorStore - Stub vector store create_test_double() - Factory for test doubles

Performance Testing:

measure_latency() - Measure response latency measure_throughput() - Measure throughput benchmark_prompts() - Benchmark prompt performance PerformanceReport - Performance test report

Cost Tracking:

CostTracker - Track testing costs estimate_test_cost() - Estimate cost before running get_cost_report() - Get cost breakdown

Utilities:

seed_randomness() - Set random seed for reproducibility capture_warnings() - Capture warning messages isolate_test() - Isolation context manager cleanup_resources() - Clean up test resources

Data Classes:

MockResponse - Mock LLM response TestCase - Test case definition TestResult - Test execution result FixtureData - Fixture data container

Enums:

MockBehavior - Mock behavior modes FixtureFormat - Fixture file formats

class kerb.testing.MockBehavior(*values)[source]

Bases: Enum

Behavior modes for mock LLM.

FIXED = 'fixed'
SEQUENTIAL = 'sequential'
RANDOM = 'random'
PATTERN = 'pattern'
CALLABLE = 'callable'
class kerb.testing.FixtureFormat(*values)[source]

Bases: Enum

Supported fixture file formats.

JSON = 'json'
JSONL = 'jsonl'
CSV = 'csv'
YAML = 'yaml'
class kerb.testing.MockResponse(content, model='mock-model', finish_reason='stop', prompt_tokens=0, completion_tokens=0, latency=0.1, metadata=<factory>)[source]

Bases: object

Mock LLM response.

content: str
model: str = 'mock-model'
finish_reason: str = 'stop'
prompt_tokens: int = 0
completion_tokens: int = 0
latency: float = 0.1
metadata: Dict[str, Any]
to_generation_response()[source]

Convert to GenerationResponse format.

__init__(content, model='mock-model', finish_reason='stop', prompt_tokens=0, completion_tokens=0, latency=0.1, metadata=<factory>)
class kerb.testing.TestCase(id, prompt, expected_output=None, expected_patterns=None, metadata=<factory>, validation_fn=None)[source]

Bases: object

Test case definition.

id: str
prompt: str
expected_output: str | None = None
expected_patterns: List[str] | None = None
metadata: Dict[str, Any]
validation_fn: Callable | None = None
__init__(id, prompt, expected_output=None, expected_patterns=None, metadata=<factory>, validation_fn=None)
class kerb.testing.TestResult(test_id, passed, actual_output, expected_output=None, error=None, latency=0.0, timestamp=<factory>, metadata=<factory>)[source]

Bases: object

Test execution result.

test_id: str
passed: bool
actual_output: str
expected_output: str | None = None
error: str | None = None
latency: float = 0.0
timestamp: str
metadata: Dict[str, Any]
__init__(test_id, passed, actual_output, expected_output=None, error=None, latency=0.0, timestamp=<factory>, metadata=<factory>)
class kerb.testing.FixtureData(prompt, response, metadata=<factory>, tags=<factory>, created_at=<factory>)[source]

Bases: object

Container for fixture data.

prompt: str
response: str
metadata: Dict[str, Any]
tags: List[str]
created_at: str
__init__(prompt, response, metadata=<factory>, tags=<factory>, created_at=<factory>)
class kerb.testing.PromptFixture(id, prompt, expected_response, variables=<factory>, metadata=<factory>)[source]

Bases: object

Fixture for prompt-response pairs.

id: str
prompt: str
expected_response: str
variables: Dict[str, Any]
metadata: Dict[str, Any]
__init__(id, prompt, expected_response, variables=<factory>, metadata=<factory>)
class kerb.testing.ResponseFixture(pattern, response, response_type='exact', metadata=<factory>)[source]

Bases: object

Fixture for deterministic responses.

pattern: str
response: str
response_type: str = 'exact'
metadata: Dict[str, Any]
__init__(pattern, response, response_type='exact', metadata=<factory>)
class kerb.testing.PromptTestCase(name, prompt_template, test_inputs, expected_outputs=None, validators=<factory>, metadata=<factory>)[source]

Bases: object

Prompt test case for regression testing.

name: str
prompt_template: str
test_inputs: List[Dict[str, Any]]
expected_outputs: List[str] | None = None
validators: List[Callable]
metadata: Dict[str, Any]
__init__(name, prompt_template, test_inputs, expected_outputs=None, validators=<factory>, metadata=<factory>)
class kerb.testing.SnapshotData(name, content, hash, created_at=<factory>, metadata=<factory>)[source]

Bases: object

Snapshot data for snapshot testing.

name: str
content: str
hash: str
created_at: str
metadata: Dict[str, Any]
__init__(name, content, hash, created_at=<factory>, metadata=<factory>)
class kerb.testing.PerformanceMetrics(total_requests, total_latency, avg_latency, min_latency, max_latency, p50_latency, p95_latency, p99_latency, throughput, tokens_per_second, metadata=<factory>)[source]

Bases: object

Performance metrics for testing.

total_requests: int
total_latency: float
avg_latency: float
min_latency: float
max_latency: float
p50_latency: float
p95_latency: float
p99_latency: float
throughput: float
tokens_per_second: float
metadata: Dict[str, Any]
__init__(total_requests, total_latency, avg_latency, min_latency, max_latency, p50_latency, p95_latency, p99_latency, throughput, tokens_per_second, metadata=<factory>)
class kerb.testing.CostReport(total_cost, total_tokens, total_requests, cost_by_model, tokens_by_model, timestamp=<factory>)[source]

Bases: object

Cost tracking report.

total_cost: float
total_tokens: int
total_requests: int
cost_by_model: Dict[str, float]
tokens_by_model: Dict[str, int]
timestamp: str
__init__(total_cost, total_tokens, total_requests, cost_by_model, tokens_by_model, timestamp=<factory>)
class kerb.testing.MockLLM(responses=None, behavior=MockBehavior.FIXED, default_response='Mock response', latency=0.1, token_calculator=None)[source]

Bases: object

Mock LLM provider with configurable responses.

This class provides a drop-in replacement for real LLM providers, useful for testing without making actual API calls.

__init__(responses=None, behavior=MockBehavior.FIXED, default_response='Mock response', latency=0.1, token_calculator=None)[source]

Initialize mock LLM.

Parameters:
  • responses (Union[str, List[str], Dict[str, str], None]) – Response(s) to return

  • behavior (MockBehavior) – Behavior mode for returning responses

  • default_response (str) – Default response when no match found

  • latency (float) – Simulated latency per response

  • token_calculator (Optional[Callable[[str], int]]) – Function to calculate token counts

generate(prompt, **kwargs)[source]

Generate a mock response.

Parameters:
  • prompt (Union[str, List[Dict[str, str]]]) – Input prompt (string or message list)

  • **kwargs – Additional generation parameters (ignored)

Return type:

MockResponse

Returns:

MockResponse object

reset()[source]

Reset call count and history.

Return type:

None

get_last_call()[source]

Get the last call made to the mock.

Return type:

Optional[Dict[str, Any]]

assert_called()[source]

Assert that the mock was called at least once.

Return type:

None

assert_called_with(prompt_contains)[source]

Assert that the mock was called with a prompt containing text.

Return type:

None

class kerb.testing.MockStreamingLLM(response, chunk_size=10, delay_per_chunk=0.01)[source]

Bases: object

Mock streaming LLM for testing streaming responses.

__init__(response, chunk_size=10, delay_per_chunk=0.01)[source]

Initialize mock streaming LLM.

Parameters:
  • response (str) – Full response to stream

  • chunk_size (int) – Characters per chunk

  • delay_per_chunk (float) – Delay between chunks in seconds

generate_stream(prompt, **kwargs)[source]

Generate streaming mock response.

Parameters:
  • prompt (Union[str, List[Dict[str, str]]]) – Input prompt

  • **kwargs – Additional parameters (ignored)

Yields:

Response chunks

kerb.testing.create_mock_llm(responses, behavior=MockBehavior.FIXED, **kwargs)[source]

Helper to create a mock LLM instance.

Parameters:
  • responses (Union[str, List[str], Dict[str, str]]) – Response(s) to configure

  • behavior (MockBehavior) – Behavior mode

  • **kwargs – Additional MockLLM parameters

Return type:

MockLLM

Returns:

Configured MockLLM instance

class kerb.testing.FixtureManager(fixtures_dir=None)[source]

Bases: object

Manage and organize test fixtures.

__init__(fixtures_dir=None)[source]

Initialize fixture manager.

Parameters:

fixtures_dir (Optional[Path]) – Directory to store fixtures

add_fixture(name, prompt, response, **kwargs)[source]

Add a fixture.

Parameters:
  • name (str) – Fixture name/ID

  • prompt (str) – Prompt text

  • response (str) – Response text

  • **kwargs – Additional metadata

Return type:

None

get_fixture(name)[source]

Get a fixture by name.

Return type:

Optional[FixtureData]

save(format=FixtureFormat.JSON)[source]

Save fixtures to disk.

Return type:

None

load(filepath)[source]

Load fixtures from disk.

Return type:

None

kerb.testing.load_fixtures(filepath)[source]

Load fixtures from a file.

Parameters:

filepath (Path) – Path to fixture file

Return type:

Dict[str, FixtureData]

Returns:

Dictionary of fixture name to FixtureData

kerb.testing.save_fixtures(fixtures, filepath, format=FixtureFormat.JSON)[source]

Save fixtures to a file.

Parameters:
Return type:

None

class kerb.testing.TestDataset(name, examples=None)[source]

Bases: object

Dataset for evaluation testing.

__init__(name, examples=None)[source]

Initialize dataset.

Parameters:
add_example(input, output, metadata=None)[source]

Add an example to the dataset.

Return type:

None

__len__()[source]

Get dataset size.

Return type:

int

__getitem__(idx)[source]

Get example by index.

Return type:

Dict[str, Any]

__iter__()[source]

Iterate over examples.

save(filepath)[source]

Save dataset to disk.

Return type:

None

classmethod load(filepath)[source]

Load dataset from disk.

Return type:

TestDataset

kerb.testing.create_dataset(name, examples, metadata=None)[source]

Create a test dataset.

Parameters:
Return type:

TestDataset

Returns:

TestDataset instance

kerb.testing.load_dataset(filepath)[source]

Load dataset from file.

Parameters:

filepath (Path) – Path to dataset file

Return type:

TestDataset

Returns:

TestDataset instance

kerb.testing.assert_response_contains(response, expected, case_sensitive=False)[source]

Assert that response contains expected text.

Parameters:
  • response (str) – Response to check

  • expected (Union[str, List[str]]) – Expected text or list of expected texts

  • case_sensitive (bool) – Whether to do case-sensitive matching

Return type:

None

kerb.testing.assert_response_json(response, expected_schema=None)[source]

Assert that response is valid JSON.

Parameters:
  • response (str) – Response to check

  • expected_schema (Optional[Dict]) – Optional JSON schema to validate against

Return type:

Dict[str, Any]

Returns:

Parsed JSON data

kerb.testing.assert_response_quality(response, min_words=None, no_repetition=False, no_empty_lines=False)[source]

Assert response quality metrics.

Parameters:
  • response (str) – Response to check

  • min_words (Optional[int]) – Minimum word count

  • no_repetition (bool) – Check for excessive repetition

  • no_empty_lines (bool) – Check for empty lines

Return type:

None

Testing utilities for LLM outputs and evaluation.