Fine-Tuning Module

Fine-tuning utilities for preparing and managing LLM training data.

This module provides comprehensive tools for fine-tuning LLMs across multiple providers:

Common Usage:

from kerb.fine_tuning import prepare_dataset, TrainingDataset from kerb.fine_tuning import validate_dataset, analyze_dataset

Submodules:

types - Core data classes and enums dataset - Dataset preparation and manipulation formats - Format conversion for different providers jsonl - JSONL file utilities validation - Dataset validation functions quality - Data quality analysis prompts - System prompt utilities training - Training configuration and optimization

Dataset Preparation:

prepare_dataset() - Main function to prepare datasets for fine-tuning split_dataset() - Split data into train/validation/test sets balance_dataset() - Balance dataset by label/category augment_dataset() - Augment training data with variations deduplicate_dataset() - Remove duplicate examples sample_dataset() - Sample subset of dataset shuffle_dataset() - Randomize dataset order filter_dataset() - Filter dataset by criteria

Format Conversion:

to_openai_format() - Convert to OpenAI fine-tuning format to_anthropic_format() - Convert to Anthropic fine-tuning format to_google_format() - Convert to Google AI fine-tuning format to_huggingface_format() - Convert to HuggingFace format to_generic_format() - Convert to generic JSONL format from_csv() - Convert CSV to fine-tuning format from_json() - Convert JSON to fine-tuning format from_parquet() - Convert Parquet to fine-tuning format

JSONL Utilities:

write_jsonl() - Write data to JSONL file read_jsonl() - Read data from JSONL file append_jsonl() - Append data to JSONL file merge_jsonl() - Merge multiple JSONL files validate_jsonl() - Validate JSONL file format count_jsonl_lines() - Count lines in JSONL file stream_jsonl() - Stream large JSONL files

Validation:

validate_dataset() - Validate dataset for fine-tuning validate_format() - Validate format for specific provider check_token_limits() - Check if examples exceed token limits validate_messages() - Validate message structure estimate_training_tokens() - Estimate total training tokens estimate_cost() - Estimate fine-tuning cost validate_completion_format() - Validate completion-based format validate_chat_format() - Validate chat-based format

Data Quality:

analyze_dataset() - Analyze dataset statistics check_data_quality() - Check for quality issues detect_pii() - Detect personally identifiable information compute_perplexity() - Compute perplexity with HuggingFace models check_length_distribution() - Analyze token length distribution detect_duplicates() - Find duplicate or near-duplicate examples check_label_distribution() - Analyze label distribution

System Prompts:

generate_system_prompt() - Generate system prompts from examples extract_system_prompts() - Extract system prompts from dataset standardize_system_prompts() - Standardize system prompts optimize_system_prompt() - Optimize system prompt for task

Training Utilities:

create_training_config() - Create training configuration estimate_training_time() - Estimate training duration calculate_optimal_batch_size() - Calculate optimal batch size recommend_learning_rate() - Recommend learning rate create_hyperparameter_grid() - Create hyperparameter search grid

Data Classes:

TrainingExample - Single training example TrainingDataset - Complete training dataset ValidationResult - Validation results DatasetStats - Dataset statistics TrainingConfig - Training configuration

Enums:

FineTuningProvider - Supported providers DatasetFormat - Supported formats SplitStrategy - Dataset split strategies ValidationLevel - Validation strictness levels

class kerb.fine_tuning.FineTuningProvider(*values)[source]

Bases: Enum

Supported fine-tuning providers.

OPENAI = 'openai'
ANTHROPIC = 'anthropic'
GOOGLE = 'google'
HUGGINGFACE = 'huggingface'
GENERIC = 'generic'
class kerb.fine_tuning.DatasetFormat(*values)[source]

Bases: Enum

Supported dataset formats.

CHAT = 'chat'
COMPLETION = 'completion'
CLASSIFICATION = 'classification'
INSTRUCTION = 'instruction'
class kerb.fine_tuning.SplitStrategy(*values)[source]

Bases: Enum

Dataset splitting strategies.

RANDOM = 'random'
STRATIFIED = 'stratified'
TEMPORAL = 'temporal'
HASH = 'hash'
class kerb.fine_tuning.ValidationLevel(*values)[source]

Bases: Enum

Validation strictness levels.

STRICT = 'strict'
MODERATE = 'moderate'
LENIENT = 'lenient'
class kerb.fine_tuning.TrainingExample(messages=None, prompt=None, completion=None, label=None, metadata=<factory>)[source]

Bases: object

Represents a single training example.

messages: List[Dict[str, str]] | None = None
prompt: str | None = None
completion: str | None = None
label: str | None = None
metadata: Dict[str, Any]
to_dict()[source]

Convert to dictionary representation.

Return type:

Dict[str, Any]

get_text_content()[source]

Extract all text content from the example.

Return type:

str

compute_hash()[source]

Compute hash of example content for deduplication.

Return type:

str

__init__(messages=None, prompt=None, completion=None, label=None, metadata=<factory>)
class kerb.fine_tuning.TrainingDataset(examples, format, provider=None, metadata=<factory>)[source]

Bases: object

Represents a complete training dataset.

examples: List[TrainingExample]
format: DatasetFormat
provider: FineTuningProvider | None = None
metadata: Dict[str, Any]
to_list()[source]

Convert to list of dictionaries.

Return type:

List[Dict[str, Any]]

__init__(examples, format, provider=None, metadata=<factory>)
class kerb.fine_tuning.ValidationResult(is_valid, errors=<factory>, warnings=<factory>, total_examples=0, valid_examples=0, invalid_examples=0)[source]

Bases: object

Results from dataset validation.

is_valid: bool
errors: List[str]
warnings: List[str]
total_examples: int = 0
valid_examples: int = 0
invalid_examples: int = 0
add_error(error)[source]

Add an error message.

add_warning(warning)[source]

Add a warning message.

__init__(is_valid, errors=<factory>, warnings=<factory>, total_examples=0, valid_examples=0, invalid_examples=0)
class kerb.fine_tuning.DatasetStats(total_examples=0, total_tokens=0, avg_tokens_per_example=0.0, min_tokens=0, max_tokens=0, label_distribution=<factory>, avg_prompt_tokens=0.0, avg_completion_tokens=0.0, duplicate_count=0, metadata=<factory>)[source]

Bases: object

Statistics about a dataset.

total_examples: int = 0
total_tokens: int = 0
avg_tokens_per_example: float = 0.0
min_tokens: int = 0
max_tokens: int = 0
label_distribution: Dict[str, int]
avg_prompt_tokens: float = 0.0
avg_completion_tokens: float = 0.0
duplicate_count: int = 0
metadata: Dict[str, Any]
__init__(total_examples=0, total_tokens=0, avg_tokens_per_example=0.0, min_tokens=0, max_tokens=0, label_distribution=<factory>, avg_prompt_tokens=0.0, avg_completion_tokens=0.0, duplicate_count=0, metadata=<factory>)
class kerb.fine_tuning.TrainingConfig(model, n_epochs=3, batch_size=None, learning_rate_multiplier=None, prompt_loss_weight=0.01, validation_file=None, suffix=None, metadata=<factory>)[source]

Bases: object

Training configuration for fine-tuning.

model: str
n_epochs: int = 3
batch_size: int | None = None
learning_rate_multiplier: float | None = None
prompt_loss_weight: float = 0.01
validation_file: str | None = None
suffix: str | None = None
metadata: Dict[str, Any]
__init__(model, n_epochs=3, batch_size=None, learning_rate_multiplier=None, prompt_loss_weight=0.01, validation_file=None, suffix=None, metadata=<factory>)
kerb.fine_tuning.prepare_dataset(data, format=DatasetFormat.CHAT, provider=None, validate=True, deduplicate=True, shuffle=True)[source]

Prepare dataset for fine-tuning.

Parameters:
Returns:

Prepared dataset

Return type:

TrainingDataset

Examples

>>> data = [
...     {"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}]},
...     {"messages": [{"role": "user", "content": "Bye"}, {"role": "assistant", "content": "Goodbye!"}]}
... ]
>>> dataset = prepare_dataset(data, format=DatasetFormat.CHAT)
kerb.fine_tuning.split_dataset(dataset, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, strategy=SplitStrategy.RANDOM, seed=None)[source]

Split dataset into train/validation/test sets.

Parameters:
  • dataset (TrainingDataset) – Dataset to split

  • train_ratio (float) – Proportion for training

  • val_ratio (float) – Proportion for validation

  • test_ratio (float) – Proportion for testing

  • strategy (SplitStrategy) – Splitting strategy

  • seed (Optional[int]) – Random seed for reproducibility

Return type:

Tuple[TrainingDataset, TrainingDataset, TrainingDataset]

Returns:

Tuple of (train_dataset, val_dataset, test_dataset)

kerb.fine_tuning.balance_dataset(dataset, method='undersample', target_count=None)[source]

Balance dataset by label distribution.

Parameters:
  • dataset (TrainingDataset) – Dataset to balance

  • method (Union[BalanceMethod, str]) – Balancing method (BalanceMethod enum or string: ‘undersample’, ‘oversample’, ‘smote’, ‘none’)

  • target_count (Optional[int]) – Target count per label (if None, uses minority class for undersample or majority for oversample)

Returns:

Balanced dataset

Return type:

TrainingDataset

Examples

>>> from kerb.core.enums import BalanceMethod
>>> balanced = balance_dataset(dataset, method=BalanceMethod.UNDERSAMPLE)
kerb.fine_tuning.deduplicate_dataset(dataset, similarity_threshold=1.0)[source]

Remove duplicate examples from dataset.

Parameters:
  • dataset (TrainingDataset) – Dataset to deduplicate

  • similarity_threshold (float) – Threshold for considering examples duplicates (1.0 = exact match)

Returns:

Deduplicated dataset

Return type:

TrainingDataset

kerb.fine_tuning.sample_dataset(dataset, n, seed=None)[source]

Sample subset of dataset.

Parameters:
Returns:

Sampled dataset

Return type:

TrainingDataset

kerb.fine_tuning.filter_dataset(dataset, filter_fn)[source]

Filter dataset by custom criteria.

Parameters:
Returns:

Filtered dataset

Return type:

TrainingDataset

kerb.fine_tuning.to_openai_format(dataset)[source]

Convert dataset to OpenAI fine-tuning format.

OpenAI format: {“messages”: [{“role”: “system/user/assistant”, “content”: “…”}]}

Parameters:

dataset (TrainingDataset) – Dataset to convert

Return type:

List[Dict[str, Any]]

Returns:

List of examples in OpenAI format

kerb.fine_tuning.to_anthropic_format(dataset)[source]

Convert dataset to Anthropic fine-tuning format.

Parameters:

dataset (TrainingDataset) – Dataset to convert

Return type:

List[Dict[str, Any]]

Returns:

List of examples in Anthropic format

kerb.fine_tuning.from_csv(filepath, prompt_column, completion_column=None, label_column=None, format=DatasetFormat.COMPLETION)[source]

Convert CSV file to training dataset.

Parameters:
  • filepath (str) – Path to CSV file

  • prompt_column (str) – Name of prompt column

  • completion_column (Optional[str]) – Name of completion column

  • label_column (Optional[str]) – Name of label column

  • format (DatasetFormat) – Target format

Return type:

TrainingDataset

Returns:

TrainingDataset

kerb.fine_tuning.from_json(filepath, format=DatasetFormat.CHAT)[source]

Convert JSON file to training dataset.

Parameters:
Return type:

TrainingDataset

Returns:

TrainingDataset

kerb.fine_tuning.write_jsonl(data, filepath, compress=False, compression_type='gz', buffer_size=8192)[source]

Write data to JSONL file with optional compression.

Parameters:
  • data (Union[List[Dict[str, Any]], TrainingDataset]) – Data to write (list of dicts or TrainingDataset)

  • filepath (str) – Output file path

  • compress (bool) – Whether to compress the output

  • compression_type (str) – Type of compression (‘gz’, ‘bz2’, ‘xz’)

  • buffer_size (int) – Buffer size for writing

kerb.fine_tuning.read_jsonl(filepath, max_lines=None, skip_invalid=False)[source]

Read data from JSONL file with automatic compression detection.

Parameters:
  • filepath (str) – Input file path (supports .gz, .bz2, .xz compression)

  • max_lines (Optional[int]) – Maximum number of lines to read (None for all)

  • skip_invalid (bool) – Whether to skip invalid JSON lines

Return type:

List[Dict[str, Any]]

Returns:

List of dictionaries

kerb.fine_tuning.validate_dataset(dataset, level=ValidationLevel.MODERATE, max_tokens=None)[source]

Validate dataset for fine-tuning.

Parameters:
Return type:

ValidationResult

Returns:

ValidationResult

kerb.fine_tuning.estimate_cost(dataset, model='gpt-4o-mini', n_epochs=3)[source]

Estimate fine-tuning cost.

Parameters:
  • dataset (TrainingDataset) – Dataset to train on

  • model (str) – Base model name

  • n_epochs (int) – Number of training epochs

Return type:

Dict[str, float]

Returns:

Dictionary with cost estimates

kerb.fine_tuning.analyze_dataset(dataset)[source]

Analyze dataset statistics.

Parameters:

dataset (TrainingDataset) – Dataset to analyze

Return type:

DatasetStats

Returns:

DatasetStats with comprehensive statistics

kerb.fine_tuning.check_data_quality(dataset)[source]

Check dataset for quality issues.

Parameters:

dataset (TrainingDataset) – Dataset to check

Return type:

Dict[str, Any]

Returns:

Dictionary with quality metrics and issues

kerb.fine_tuning.create_training_config(model, n_epochs=3, batch_size=None, learning_rate_multiplier=None, **kwargs)[source]

Create training configuration.

Parameters:
  • model (str) – Base model name

  • n_epochs (int) – Number of training epochs

  • batch_size (Optional[int]) – Batch size (if None, provider determines automatically)

  • learning_rate_multiplier (Optional[float]) – Learning rate multiplier

  • **kwargs – Additional configuration options

Return type:

TrainingConfig

Returns:

TrainingConfig

kerb.fine_tuning.estimate_training_time(dataset, n_epochs=3, batch_size=8)[source]

Estimate training duration.

Parameters:
  • dataset (TrainingDataset) – Training dataset

  • n_epochs (int) – Number of epochs

  • batch_size (int) – Batch size

Return type:

Dict[str, Any]

Returns:

Dictionary with time estimates

Model fine-tuning utilities and large dataset preparation.