Generation Module

LLM generation utilities for unified API access and response management.

This module provides comprehensive tools for LLM generation across multiple providers.

Usage:

# Common from kerb.generation import generate, generate_stream, Generator

# Providers from kerb.generation.providers import (

OpenAIGenerator, AnthropicGenerator, GoogleGenerator,

)

# Utilities from kerb.generation.utils import retry_with_exponential_backoff, batch_generate

kerb.generation.generate(messages, model=None, config=None, api_key=None, provider=None, use_cache=True, cost_tracker=None, track_cost=False, rate_limiter=None, max_retries=3, **kwargs)[source]

Universal generator function - generate responses from any LLM provider.

This is the main generation function that routes to the appropriate provider based on the model and provider parameters.

Parameters:

messages (Union[List[Message], List[Dict[str, str]], str]) – Input messages (can be string, list of dicts, or list of Message objects)
model (Union[str, ModelName, None]) – Model to use (ModelName enum or string for custom models). If not provided, must be specified in config.
config (Optional[GenerationConfig]) – Generation configuration
api_key (Optional[str]) – API key (if not provided, uses environment variable)
provider (Optional[LLMProvider]) – LLMProvider enum specifying which API to use
use_cache (bool) – Whether to use response caching
cost_tracker (Optional[CostTracker]) – Optional cost tracker instance
track_cost (bool) – Whether to track costs in global tracker
rate_limiter (Optional[RateLimiter]) – Optional rate limiter instance
max_retries (int) – Maximum retry attempts for failed requests
**kwargs – Additional config parameters

Returns:

The generated response

Return type:

GenerationResponse

Examples

>>> # Using ModelName enum
>>> response = generate("Hello", model=ModelName.GPT_4O_MINI, provider=LLMProvider.OPENAI)

>>> # Using custom model name
>>> response = generate("Hello", model="my-custom-gpt", provider=LLMProvider.OPENAI)

>>> # Different providers
>>> response = generate("Hello", model=ModelName.CLAUDE_35_HAIKU, provider=LLMProvider.ANTHROPIC)

kerb.generation.generate_stream(messages, model=None, config=None, api_key=None, provider=None, callback=None, **kwargs)[source]

Generate streaming response from any LLM provider.

Parameters:

messages (Union[List[Message], List[Dict[str, str]], str]) – Input messages (can be string, list of dicts, or list of Message objects)
model (Union[str, ModelName, None]) – Model to use (ModelName enum or string for custom models). If not provided, must be specified in config.
config (Optional[GenerationConfig]) – Generation configuration
api_key (Optional[str]) – API key (if not provided, uses environment variable)
provider (Optional[LLMProvider]) – LLMProvider enum specifying which API to use
callback (Optional[Callable[[StreamChunk], None]]) – Optional callback function for each chunk
**kwargs – Additional config parameters

Yields:

StreamChunk – Chunks of the generated response

kerb.generation.generate_batch(prompts, model=None, config=None, api_key=None, provider=None, max_concurrent=5, show_progress=False, **kwargs)[source]

Generate batch responses.

Parameters:

prompts (List[Union[str, List[Message]]]) – List of prompts to process
model (Union[str, ModelName, None]) – Model to use (ModelName enum or string for custom models). If not provided, must be specified in config.
config (Optional[GenerationConfig]) – Generation configuration
api_key (Optional[str]) – API key (if not provided, uses environment variable)
provider (Optional[LLMProvider]) – LLMProvider enum specifying which API to use
max_concurrent (int) – Maximum concurrent requests
show_progress (bool) – Whether to show progress
**kwargs – Additional config parameters

Returns:

List of generated responses

Return type:

List[GenerationResponse]

async kerb.generation.generate_async(messages, model=None, config=None, api_key=None, provider=None, use_cache=True, cost_tracker=None, track_cost=False, max_retries=3, **kwargs)[source]

Async generation.

Parameters:

messages (Union[List[Message], List[Dict[str, str]], str]) – Input messages (can be string, list of dicts, or list of Message objects)
model (Union[str, ModelName, None]) – Model to use (ModelName enum or string for custom models). If not provided, must be specified in config.
config (Optional[GenerationConfig]) – Generation configuration
api_key (Optional[str]) – API key (if not provided, uses environment variable)
provider (Optional[LLMProvider]) – LLMProvider enum specifying which API to use
use_cache (bool) – Whether to use response caching
cost_tracker (Optional[CostTracker]) – Optional cost tracker instance
track_cost (bool) – Whether to track costs in global tracker
max_retries (int) – Maximum retry attempts for failed requests
**kwargs – Additional config parameters

Returns:

The generated response

Return type:

GenerationResponse

class kerb.generation.Generator(model, api_key=None, provider=None, cost_tracker=None, **default_config)[source]

Bases: object

Universal LLM generator - easily switch between models and providers.

This class provides a convenient stateful interface for LLM generation with support for both enum-based and string-based model specification. It makes it easy to switch between different models and providers without changing your code structure.

Examples

>>> # Using ModelName enum
>>> gen = Generator(model=ModelName.GPT_4O_MINI, provider=LLMProvider.OPENAI)
>>> response = gen.generate("Hello!")

>>> # Using custom model name
>>> gen = Generator(model="my-custom-model", provider=LLMProvider.OPENAI)
>>> response = gen.generate("Hello!")

>>> # Easy model switching
>>> gen_gpt = Generator(model=ModelName.GPT_4O_MINI, provider=LLMProvider.OPENAI, temperature=0.7)
>>> gen_claude = Generator(model=ModelName.CLAUDE_35_HAIKU, provider=LLMProvider.ANTHROPIC, temperature=0.7)

__init__(model, api_key=None, provider=None, cost_tracker=None, **default_config)[source]

Initialize the universal Generator.

Parameters:

model (Union[str, ModelName]) – Model to use (ModelName enum or string for custom models)
api_key (Optional[str]) – API key (if not provided, uses environment variable)
provider (Optional[LLMProvider]) – LLMProvider enum specifying which API to use
cost_tracker (Optional[CostTracker]) – Optional cost tracker instance
**default_config – Default configuration parameters (temperature, max_tokens, etc.)

generate(messages, **kwargs)[source]

Generate a response.

Parameters:

messages (Union[List[Message], List[Dict[str, str]], str]) – Input messages
**kwargs – Override default config parameters

Returns:

The generated response

Return type:

GenerationResponse

stream(messages, **kwargs)[source]

Generate a streaming response.

Parameters:

messages (Union[List[Message], List[Dict[str, str]], str]) – Input messages
**kwargs – Override default config parameters

Yields:

StreamChunk – Chunks of the generated response

batch(prompts, **kwargs)[source]

Generate batch responses.

Parameters:

prompts (List[Union[str, List[Message]]]) – List of prompts to process
**kwargs – Override default config parameters

Returns:

List of generated responses

Return type:

List[GenerationResponse]

class kerb.generation.GenerationConfig(model, temperature=0.7, max_tokens=None, top_p=1.0, frequency_penalty=0.0, presence_penalty=0.0, stop_sequences=None, stream=False, n=1, logprobs=None, seed=None, response_format=None, tools=None, tool_choice=None, reasoning_level=None, reasoning_budget=None, enable_grounding=None, grounding_config=None)[source]

Bases: object

Configuration for LLM generation.

model: str

temperature: float = 0.7

max_tokens: int | None = None

top_p: float = 1.0

frequency_penalty: float = 0.0

presence_penalty: float = 0.0

stop_sequences: List[str] | None = None

stream: bool = False

n: int = 1

logprobs: int | None = None

seed: int | None = None

response_format: Dict[str, Any] | None = None

tools: List[Dict[str, Any]] | None = None

tool_choice: str | Dict[str, Any] | None = None

reasoning_level: str | ReasoningLevel | None = None

reasoning_budget: int | None = None

enable_grounding: bool | None = None

grounding_config: Dict[str, Any] | None = None

__init__(model, temperature=0.7, max_tokens=None, top_p=1.0, frequency_penalty=0.0, presence_penalty=0.0, stop_sequences=None, stream=False, n=1, logprobs=None, seed=None, response_format=None, tools=None, tool_choice=None, reasoning_level=None, reasoning_budget=None, enable_grounding=None, grounding_config=None)

class kerb.generation.GenerationResponse(content, model, provider, usage, finish_reason=None, latency=0.0, cost=0.0, cached=False, metadata=<factory>, raw_response=None)[source]

Bases: object

Response from LLM generation.

Note: LLMProvider is imported lazily to avoid circular imports.

content: str

model: str

provider: Any

usage: Usage

finish_reason: str | None = None

latency: float = 0.0

cost: float = 0.0

cached: bool = False

metadata: Dict[str, Any]

raw_response: Any | None = None

to_dict()[source]

Convert to dictionary format.

Return type:: Dict[str, Any]

__init__(content, model, provider, usage, finish_reason=None, latency=0.0, cost=0.0, cached=False, metadata=<factory>, raw_response=None)

class kerb.generation.StreamChunk(content, finish_reason=None, model=None, metadata=<factory>)[source]

Bases: object

Represents a chunk from streaming generation.

content: str

finish_reason: str | None = None

model: str | None = None

metadata: Dict[str, Any]

__init__(content, finish_reason=None, model=None, metadata=<factory>)

class kerb.generation.Usage(prompt_tokens=0, completion_tokens=0, total_tokens=0)[source]

Bases: object

Token usage information.

prompt_tokens: int = 0

completion_tokens: int = 0

total_tokens: int = 0

property cost: float: Calculate cost based on token usage (requires model pricing).

__init__(prompt_tokens=0, completion_tokens=0, total_tokens=0)

class kerb.generation.LLMProvider(*values)[source]

Bases: Enum

Supported LLM providers.

OPENAI = 'openai'

ANTHROPIC = 'anthropic'

GOOGLE = 'google'

LOCAL = 'local'

class kerb.generation.ModelName(*values)[source]

Bases: Enum

Supported model names with their actual string identifiers.

GPT_5_4 = 'gpt-5.4'

GPT_5_4_PRO = 'gpt-5.4-pro'

GPT_5_3 = 'gpt-5.3'

GPT_5_2 = 'gpt-5.2'

GPT_5 = 'gpt-5'

GPT_5_MINI = 'gpt-5-mini'

GPT_5_NANO = 'gpt-5-nano'

GPT_4_1 = 'gpt-4.1'

GPT_4_1_MINI = 'gpt-4.1-mini'

GPT_4_1_NANO = 'gpt-4.1-nano'

GPT_4O = 'gpt-4o'

GPT_4O_MINI = 'gpt-4o-mini'

O3_PRO = 'o3-pro'

O3 = 'o3'

O3_MINI = 'o3-mini'

O1 = 'o1'

O1_PRO = 'o1-pro'

O1_MINI = 'o1-mini'

O4_MINI = 'o4-mini'

CLAUDE_OPUS_4_6 = 'claude-opus-4-6'

CLAUDE_SONNET_4_6 = 'claude-sonnet-4-6'

CLAUDE_HAIKU_4_5 = 'claude-haiku-4-5'

CLAUDE_45_OPUS = 'claude-4.5-opus-20260212'

CLAUDE_45_SONNET = 'claude-4.5-sonnet-20260212'

CLAUDE_45_HAIKU = 'claude-4.5-haiku-20260212'

CLAUDE_OPUS_4 = 'claude-opus-4'

CLAUDE_SONNET_4 = 'claude-sonnet-4'

CLAUDE_35_SONNET = 'claude-3-5-sonnet-20241022'

CLAUDE_35_HAIKU = 'claude-3-5-haiku-20241022'

GEMINI_3_1_PRO = 'gemini-3.1-pro-preview'

GEMINI_3_1_FLASH_LITE = 'gemini-3.1-flash-lite-preview'

GEMINI_3_PRO = 'gemini-3-pro'

GEMINI_3_FLASH = 'gemini-3-flash'

GEMINI_2_5_PRO = 'gemini-2.5-pro'

GEMINI_2_5_FLASH = 'gemini-2.5-flash'

GEMINI_2_5_FLASH_LITE = 'gemini-2.5-flash-lite'

class kerb.generation.MessageRole(*values)[source]

Bases: Enum

Message roles in conversations.

SYSTEM = 'system'

USER = 'user'

ASSISTANT = 'assistant'

FUNCTION = 'function'

TOOL = 'tool'

class kerb.generation.Message(role, content, timestamp=None, metadata=<factory>, name=None, function_call=None, tool_calls=None)[source]

Bases: object

Universal message representation for conversations.

Consolidates the Message classes from generation/ and memory/ packages to provide a single, consistent message representation.

role: The role of the message sender (system, user, assistant, etc.)

content: The message content

timestamp: Optional ISO format timestamp (auto-generated if not provided)

metadata: Additional metadata about the message

name: Optional name for the message sender (used in function calling)

function_call: Optional function call information (legacy)

tool_calls: Optional list of tool calls

Examples

>>> # Simple user message
>>> msg = Message(role="user", content="Hello!")

>>> # System message with enum role
>>> msg = Message(
...     role=MessageRole.SYSTEM,
...     content="You are a helpful assistant"
... )

>>> # Message with metadata
>>> msg = Message(
...     role="assistant",
...     content="Here's the answer",
...     metadata={"model": "gpt-4o", "tokens": 150}
... )

role: MessageRole | str

content: str

timestamp: str | None = None

metadata: Dict[str, Any]

name: str | None = None

function_call: Dict[str, Any] | None = None

tool_calls: List[Dict[str, Any]] | None = None

__post_init__()[source]: Auto-generate timestamp if not provided.

to_dict()[source]

Convert message to dictionary format.

Return type:: Dict[str, Any]
Returns:: Dictionary representation suitable for API calls

classmethod from_dict(data)[source]

Create message from dictionary.

Parameters:: data (Dict[str, Any]) – Dictionary with message data
Return type:: Message
Returns:: New Message instance

__repr__()[source]

String representation of the message.

Return type:: str

__init__(role, content, timestamp=None, metadata=<factory>, name=None, function_call=None, tool_calls=None)

kerb.generation.retry_with_exponential_backoff(func, max_retries=3, initial_delay=1.0, exponential_base=2.0, jitter=True, retryable_exceptions=(<class 'Exception'>, ))[source]

Retry a function with exponential backoff.

Parameters:

func (Callable) – Function to retry
max_retries (int) – Maximum number of retries
initial_delay (float) – Initial delay in seconds
exponential_base (float) – Base for exponential backoff
jitter (bool) – Whether to add random jitter
retryable_exceptions (tuple) – Exceptions that trigger retry

Return type:

Any

Returns:

Result from function

Raises:

Last exception if all retries fail –

async kerb.generation.async_retry_with_exponential_backoff(func, max_retries=3, initial_delay=1.0, exponential_base=2.0, jitter=True, retryable_exceptions=(<class 'Exception'>, ))[source]

Async version of retry with exponential backoff.

Parameters:

func (Callable) – Async function to retry
max_retries (int) – Maximum number of retries
initial_delay (float) – Initial delay in seconds
exponential_base (float) – Base for exponential backoff
jitter (bool) – Whether to add random jitter
retryable_exceptions (tuple) – Exceptions that trigger retry

Return type:

Any

Returns:

Result from function

Raises:

Last exception if all retries fail –

kerb.generation.parse_json_response(response)[source]

Parse JSON from LLM response.

Handles markdown code blocks and other formatting.

Parameters:: response (Union[GenerationResponse, str]) – GenerationResponse or content string
Returns:: Parsed JSON
Return type:: Dict[str, Any]
Raises:: ValueError – If JSON cannot be parsed

Example

>>> response = generate("Return JSON", model="gpt-4o-mini",
...                     provider=LLMProvider.OPENAI,
...                     response_format={"type": "json_object"})
>>> data = parse_json_response(response)

kerb.generation.validate_response(response, min_length=None, max_length=None, must_contain=None, must_not_contain=None, pattern=None)[source]

Validate LLM response against criteria.

Parameters:

response (GenerationResponse) – Generation response
min_length (Optional[int]) – Minimum content length
max_length (Optional[int]) – Maximum content length
must_contain (Optional[List[str]]) – Strings that must be present
must_not_contain (Optional[List[str]]) – Strings that must not be present
pattern (Optional[str]) – Regex pattern that must match

Returns:

True if valid, False otherwise

Return type:

bool

Example

>>> response = generate("List 3 programming languages",
...                     model="gpt-4o-mini", provider=LLMProvider.OPENAI)
>>> is_valid = validate_response(response, min_length=20, must_contain=["Python"])

kerb.generation.format_messages(system=None, user=None, assistant=None, history=None)[source]

Format messages for generation.

Parameters:

system (Optional[str]) – System message
user (Optional[str]) – User message
assistant (Optional[str]) – Assistant message (for few-shot examples)
history (Optional[List[Dict[str, str]]]) – Conversation history as list of {“role”: “…”, “content”: “…”}

Returns:

Formatted messages

Return type:

List[Message]

Example

>>> messages = format_messages(system="You are helpful", user="What is Python?")
>>> response = generate(messages, model="gpt-4o-mini", provider=LLMProvider.OPENAI)

kerb.generation.calculate_cost(model, usage)[source]

Calculate cost for a request.

Parameters:

model (Union[str, ModelName]) – Model name (as string or ModelName enum)
usage (Usage) – Token usage

Returns:

Cost in USD

Return type:

float

class kerb.generation.RateLimiter(requests_per_minute=60, tokens_per_minute=None)[source]

Bases: object

Simple rate limiter for API requests.

__init__(requests_per_minute=60, tokens_per_minute=None)[source]

Initialize rate limiter.

Parameters:

requests_per_minute (int) – Maximum requests per minute
tokens_per_minute (Optional[int]) – Maximum tokens per minute (optional)

wait_if_needed(estimated_tokens=0)[source]

Wait if rate limit would be exceeded.

Parameters:: estimated_tokens (int) – Estimated token count for this request
Return type:: None

class kerb.generation.ResponseCache(max_size=1000, ttl=3600)[source]

Bases: object

Simple in-memory cache for LLM responses.

__init__(max_size=1000, ttl=3600)[source]

Initialize response cache.

Parameters:

max_size (int) – Maximum number of cached responses
ttl (int) – Time to live in seconds

get(messages, config)[source]

Get cached response if available and not expired.

Return type:: Optional[GenerationResponse]

set(messages, config, response)[source]

Cache a response.

Return type:: None

class kerb.generation.CostTracker[source]

Bases: object

Track costs across LLM API calls.

__init__()[source]: Initialize cost tracker.

add_request(model, usage, cost)[source]

Record a request.

Return type:: None

get_summary()[source]

Get cost tracking summary.

Return type:: Dict[str, Any]

reset()[source]

Reset all tracking.

Return type:: None

kerb.generation.get_cost_summary(cost_tracker=None)[source]

Get cost tracking summary.

Parameters:: cost_tracker (Optional[CostTracker]) – CostTracker instance. If None, uses global tracker.
Returns:: Cost summary with totals and per-model breakdown
Return type:: Dict[str, Any]

Example

>>> generate("Hello", model="gpt-4o-mini", provider=LLMProvider.OPENAI, track_cost=True)
>>> summary = get_cost_summary()
>>> print(f"Total cost: ${summary['total_cost']}")

kerb.generation.reset_cost_tracking(cost_tracker=None)[source]

Reset cost tracking.

Parameters:: cost_tracker (Optional[CostTracker]) – CostTracker instance. If None, resets global tracker.
Return type:: None

Example

>>> reset_cost_tracking()

class kerb.generation.BaseProvider(api_key=None, **kwargs)[source]

Bases: ABC

Base class for LLM providers.

Custom providers should inherit from this class and implement the required methods.

__init__(api_key=None, **kwargs)[source]

Initialize provider.

Parameters:

api_key (Optional[str]) – API key (if None, will try to get from environment)
**kwargs – Provider-specific configuration

abstractmethod generate(messages, config)[source]

Generate a response.

Parameters:

messages (List[Message]) – List of conversation messages
config (GenerationConfig) – Generation configuration

Return type:

GenerationResponse

Returns:

GenerationResponse

abstractmethod generate_stream(messages, config)[source]

Generate a streaming response.

Parameters:

messages (List[Message]) – List of conversation messages
config (GenerationConfig) – Generation configuration

Yields:

StreamChunk

abstractmethod async generate_async(messages, config)[source]

Generate a response asynchronously.

Parameters:

messages (List[Message]) – List of conversation messages
config (GenerationConfig) – Generation configuration

Return type:

GenerationResponse

Returns:

GenerationResponse

validate_config(config)[source]

Validate configuration for this provider.

Parameters:: config (GenerationConfig) – Generation configuration
Returns:: True if valid
Return type:: bool
Raises:: ValueError – If configuration is invalid

kerb.generation.register_provider(name, provider)[source]

Register a custom provider.

Parameters:

name (str) – Provider name (used in model strings like “custom::model-name”)
provider (BaseProvider) – Provider instance

Return type:

None

Examples

>>> from kerb.generation.base import register_provider
>>> provider = MyCustomProvider(api_key="...")
>>> register_provider("mycustom", provider)
>>> # Now can use: generate(messages, model="mycustom::my-model")

kerb.generation.get_provider(name)[source]

Get a registered provider by name.

Parameters:: name (str) – Provider name
Return type:: Optional[BaseProvider]
Returns:: Provider instance or None if not found

kerb.generation.list_providers()[source]

List all registered provider names.

Return type:: List[str]

class kerb.generation.OpenAIGenerator(api_key=None, **kwargs)[source]

Bases: object

OpenAI generator with simplified interface.

This is a convenience class for OpenAI-specific generation.

__init__(api_key=None, **kwargs)[source]

Initialize OpenAI generator.

Parameters:

api_key (Optional[str]) – OpenAI API key (if None, uses OPENAI_API_KEY env var)
**kwargs – Additional configuration

generate(messages, model='gpt-4o-mini', **kwargs)[source]

Generate using OpenAI API.

Parameters:

messages (List[Message]) – Conversation messages
model (str) – Model name
**kwargs – Additional generation parameters

Return type:

GenerationResponse

Returns:

GenerationResponse

stream(messages, model='gpt-4o-mini', callback=None, **kwargs)[source]

Stream from OpenAI API.

Parameters:

messages (List[Message]) – Conversation messages
model (str) – Model name
callback (Optional[Callable[[StreamChunk], None]]) – Optional callback for each chunk
**kwargs – Additional generation parameters

Return type:

Iterator[StreamChunk]

Returns:

Iterator of StreamChunks

class kerb.generation.AnthropicGenerator(api_key=None, **kwargs)[source]

Bases: object

Anthropic generator with simplified interface.

This is a convenience class for Anthropic-specific generation.

__init__(api_key=None, **kwargs)[source]

Initialize Anthropic generator.

Parameters:

api_key (Optional[str]) – Anthropic API key (if None, uses ANTHROPIC_API_KEY env var)
**kwargs – Additional configuration

generate(messages, model='claude-3-5-haiku-20241022', **kwargs)[source]

Generate using Anthropic API.

Parameters:

messages (List[Message]) – Conversation messages
model (str) – Model name
**kwargs – Additional generation parameters

Return type:

GenerationResponse

Returns:

GenerationResponse

stream(messages, model='claude-3-5-haiku-20241022', callback=None, **kwargs)[source]

Stream from Anthropic API.

Parameters:

messages (List[Message]) – Conversation messages
model (str) – Model name
callback (Optional[Callable[[StreamChunk], None]]) – Optional callback for each chunk
**kwargs – Additional generation parameters

Return type:

Iterator[StreamChunk]

Returns:

Iterator of StreamChunks

class kerb.generation.GoogleGenerator(api_key=None, **kwargs)[source]

Bases: object

Google Gemini generator with simplified interface.

This is a convenience class for Google Gemini-specific generation.

__init__(api_key=None, **kwargs)[source]

Initialize Google Gemini generator.

Parameters:

api_key (Optional[str]) – Google API key (if None, uses GOOGLE_API_KEY env var)
**kwargs – Additional configuration

generate(messages, model='gemini-2.5-flash', **kwargs)[source]

Generate using Google Gemini API.

Parameters:

messages (List[Message]) – Conversation messages
model (str) – Model name
**kwargs – Additional generation parameters

Return type:

GenerationResponse

Returns:

GenerationResponse

stream(messages, model='gemini-2.5-flash', callback=None, **kwargs)[source]

Stream from Google Gemini API.

Parameters:

messages (List[Message]) – Conversation messages
model (str) – Model name
callback (Optional[Callable[[StreamChunk], None]]) – Optional callback for each chunk
**kwargs – Additional generation parameters

Return type:

Iterator[StreamChunk]

Returns:

Iterator of StreamChunks

Unified LLM generation with multi-provider support.