Fine-Tuning Module
Fine-tuning utilities for preparing and managing LLM training data.
This module provides comprehensive tools for fine-tuning LLMs across multiple providers:
- Common Usage:
from kerb.fine_tuning import prepare_dataset, TrainingDataset from kerb.fine_tuning import validate_dataset, analyze_dataset
- Submodules:
types - Core data classes and enums dataset - Dataset preparation and manipulation formats - Format conversion for different providers jsonl - JSONL file utilities validation - Dataset validation functions quality - Data quality analysis prompts - System prompt utilities training - Training configuration and optimization
- Dataset Preparation:
prepare_dataset() - Main function to prepare datasets for fine-tuning split_dataset() - Split data into train/validation/test sets balance_dataset() - Balance dataset by label/category augment_dataset() - Augment training data with variations deduplicate_dataset() - Remove duplicate examples sample_dataset() - Sample subset of dataset shuffle_dataset() - Randomize dataset order filter_dataset() - Filter dataset by criteria
- Format Conversion:
to_openai_format() - Convert to OpenAI fine-tuning format to_anthropic_format() - Convert to Anthropic fine-tuning format to_google_format() - Convert to Google AI fine-tuning format to_huggingface_format() - Convert to HuggingFace format to_generic_format() - Convert to generic JSONL format from_csv() - Convert CSV to fine-tuning format from_json() - Convert JSON to fine-tuning format from_parquet() - Convert Parquet to fine-tuning format
- JSONL Utilities:
write_jsonl() - Write data to JSONL file read_jsonl() - Read data from JSONL file append_jsonl() - Append data to JSONL file merge_jsonl() - Merge multiple JSONL files validate_jsonl() - Validate JSONL file format count_jsonl_lines() - Count lines in JSONL file stream_jsonl() - Stream large JSONL files
- Validation:
validate_dataset() - Validate dataset for fine-tuning validate_format() - Validate format for specific provider check_token_limits() - Check if examples exceed token limits validate_messages() - Validate message structure estimate_training_tokens() - Estimate total training tokens estimate_cost() - Estimate fine-tuning cost validate_completion_format() - Validate completion-based format validate_chat_format() - Validate chat-based format
- Data Quality:
analyze_dataset() - Analyze dataset statistics check_data_quality() - Check for quality issues detect_pii() - Detect personally identifiable information compute_perplexity() - Compute perplexity with HuggingFace models check_length_distribution() - Analyze token length distribution detect_duplicates() - Find duplicate or near-duplicate examples check_label_distribution() - Analyze label distribution
- System Prompts:
generate_system_prompt() - Generate system prompts from examples extract_system_prompts() - Extract system prompts from dataset standardize_system_prompts() - Standardize system prompts optimize_system_prompt() - Optimize system prompt for task
- Training Utilities:
create_training_config() - Create training configuration estimate_training_time() - Estimate training duration calculate_optimal_batch_size() - Calculate optimal batch size recommend_learning_rate() - Recommend learning rate create_hyperparameter_grid() - Create hyperparameter search grid
- Data Classes:
TrainingExample - Single training example TrainingDataset - Complete training dataset ValidationResult - Validation results DatasetStats - Dataset statistics TrainingConfig - Training configuration
- Enums:
FineTuningProvider - Supported providers DatasetFormat - Supported formats SplitStrategy - Dataset split strategies ValidationLevel - Validation strictness levels
- class kerb.fine_tuning.FineTuningProvider(*values)[source]
Bases:
EnumSupported fine-tuning providers.
- OPENAI = 'openai'
- ANTHROPIC = 'anthropic'
- GOOGLE = 'google'
- HUGGINGFACE = 'huggingface'
- GENERIC = 'generic'
- class kerb.fine_tuning.DatasetFormat(*values)[source]
Bases:
EnumSupported dataset formats.
- CHAT = 'chat'
- COMPLETION = 'completion'
- CLASSIFICATION = 'classification'
- INSTRUCTION = 'instruction'
- class kerb.fine_tuning.SplitStrategy(*values)[source]
Bases:
EnumDataset splitting strategies.
- RANDOM = 'random'
- STRATIFIED = 'stratified'
- TEMPORAL = 'temporal'
- HASH = 'hash'
- class kerb.fine_tuning.ValidationLevel(*values)[source]
Bases:
EnumValidation strictness levels.
- STRICT = 'strict'
- MODERATE = 'moderate'
- LENIENT = 'lenient'
- class kerb.fine_tuning.TrainingExample(messages=None, prompt=None, completion=None, label=None, metadata=<factory>)[source]
Bases:
objectRepresents a single training example.
- __init__(messages=None, prompt=None, completion=None, label=None, metadata=<factory>)
- class kerb.fine_tuning.TrainingDataset(examples, format, provider=None, metadata=<factory>)[source]
Bases:
objectRepresents a complete training dataset.
- examples: List[TrainingExample]
- format: DatasetFormat
- provider: FineTuningProvider | None = None
- __init__(examples, format, provider=None, metadata=<factory>)
- class kerb.fine_tuning.ValidationResult(is_valid, errors=<factory>, warnings=<factory>, total_examples=0, valid_examples=0, invalid_examples=0)[source]
Bases:
objectResults from dataset validation.
- __init__(is_valid, errors=<factory>, warnings=<factory>, total_examples=0, valid_examples=0, invalid_examples=0)
- class kerb.fine_tuning.DatasetStats(total_examples=0, total_tokens=0, avg_tokens_per_example=0.0, min_tokens=0, max_tokens=0, label_distribution=<factory>, avg_prompt_tokens=0.0, avg_completion_tokens=0.0, duplicate_count=0, metadata=<factory>)[source]
Bases:
objectStatistics about a dataset.
- __init__(total_examples=0, total_tokens=0, avg_tokens_per_example=0.0, min_tokens=0, max_tokens=0, label_distribution=<factory>, avg_prompt_tokens=0.0, avg_completion_tokens=0.0, duplicate_count=0, metadata=<factory>)
- class kerb.fine_tuning.TrainingConfig(model, n_epochs=3, batch_size=None, learning_rate_multiplier=None, prompt_loss_weight=0.01, validation_file=None, suffix=None, metadata=<factory>)[source]
Bases:
objectTraining configuration for fine-tuning.
- __init__(model, n_epochs=3, batch_size=None, learning_rate_multiplier=None, prompt_loss_weight=0.01, validation_file=None, suffix=None, metadata=<factory>)
- kerb.fine_tuning.prepare_dataset(data, format=DatasetFormat.CHAT, provider=None, validate=True, deduplicate=True, shuffle=True)[source]
Prepare dataset for fine-tuning.
- Parameters:
data (
Union[List[dict],TrainingDataset]) – Raw data as list of dicts or TrainingDatasetformat (
DatasetFormat) – Dataset formatprovider (
Optional[FineTuningProvider]) – Target providervalidate (
bool) – Whether to validate datasetdeduplicate (
bool) – Whether to remove duplicatesshuffle (
bool) – Whether to shuffle examples
- Returns:
Prepared dataset
- Return type:
Examples
>>> data = [ ... {"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}]}, ... {"messages": [{"role": "user", "content": "Bye"}, {"role": "assistant", "content": "Goodbye!"}]} ... ] >>> dataset = prepare_dataset(data, format=DatasetFormat.CHAT)
- kerb.fine_tuning.split_dataset(dataset, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, strategy=SplitStrategy.RANDOM, seed=None)[source]
Split dataset into train/validation/test sets.
- Parameters:
dataset (
TrainingDataset) – Dataset to splittrain_ratio (
float) – Proportion for trainingval_ratio (
float) – Proportion for validationtest_ratio (
float) – Proportion for testingstrategy (
SplitStrategy) – Splitting strategy
- Return type:
- Returns:
Tuple of (train_dataset, val_dataset, test_dataset)
- kerb.fine_tuning.balance_dataset(dataset, method='undersample', target_count=None)[source]
Balance dataset by label distribution.
- Parameters:
dataset (
TrainingDataset) – Dataset to balancemethod (
Union[BalanceMethod,str]) – Balancing method (BalanceMethod enum or string: ‘undersample’, ‘oversample’, ‘smote’, ‘none’)target_count (
Optional[int]) – Target count per label (if None, uses minority class for undersample or majority for oversample)
- Returns:
Balanced dataset
- Return type:
Examples
>>> from kerb.core.enums import BalanceMethod >>> balanced = balance_dataset(dataset, method=BalanceMethod.UNDERSAMPLE)
- kerb.fine_tuning.deduplicate_dataset(dataset, similarity_threshold=1.0)[source]
Remove duplicate examples from dataset.
- Parameters:
dataset (
TrainingDataset) – Dataset to deduplicatesimilarity_threshold (
float) – Threshold for considering examples duplicates (1.0 = exact match)
- Returns:
Deduplicated dataset
- Return type:
- kerb.fine_tuning.sample_dataset(dataset, n, seed=None)[source]
Sample subset of dataset.
- Parameters:
dataset (
TrainingDataset) – Dataset to sample fromn (
int) – Number of examples to sample
- Returns:
Sampled dataset
- Return type:
- kerb.fine_tuning.filter_dataset(dataset, filter_fn)[source]
Filter dataset by custom criteria.
- Parameters:
dataset (
TrainingDataset) – Dataset to filterfilter_fn (
Callable[[TrainingExample],bool]) – Function that returns True for examples to keep
- Returns:
Filtered dataset
- Return type:
- kerb.fine_tuning.to_openai_format(dataset)[source]
Convert dataset to OpenAI fine-tuning format.
OpenAI format: {“messages”: [{“role”: “system/user/assistant”, “content”: “…”}]}
- Parameters:
dataset (
TrainingDataset) – Dataset to convert- Return type:
- Returns:
List of examples in OpenAI format
- kerb.fine_tuning.to_anthropic_format(dataset)[source]
Convert dataset to Anthropic fine-tuning format.
- Parameters:
dataset (
TrainingDataset) – Dataset to convert- Return type:
- Returns:
List of examples in Anthropic format
- kerb.fine_tuning.from_csv(filepath, prompt_column, completion_column=None, label_column=None, format=DatasetFormat.COMPLETION)[source]
Convert CSV file to training dataset.
- Parameters:
- Return type:
- Returns:
TrainingDataset
- kerb.fine_tuning.from_json(filepath, format=DatasetFormat.CHAT)[source]
Convert JSON file to training dataset.
- Parameters:
filepath (
str) – Path to JSON fileformat (
DatasetFormat) – Target format
- Return type:
- Returns:
TrainingDataset
- kerb.fine_tuning.write_jsonl(data, filepath, compress=False, compression_type='gz', buffer_size=8192)[source]
Write data to JSONL file with optional compression.
- Parameters:
- kerb.fine_tuning.read_jsonl(filepath, max_lines=None, skip_invalid=False)[source]
Read data from JSONL file with automatic compression detection.
- kerb.fine_tuning.validate_dataset(dataset, level=ValidationLevel.MODERATE, max_tokens=None)[source]
Validate dataset for fine-tuning.
- Parameters:
dataset (
TrainingDataset) – Dataset to validatelevel (
ValidationLevel) – Validation strictness
- Return type:
- Returns:
ValidationResult
- kerb.fine_tuning.estimate_cost(dataset, model='gpt-4o-mini', n_epochs=3)[source]
Estimate fine-tuning cost.
- kerb.fine_tuning.analyze_dataset(dataset)[source]
Analyze dataset statistics.
- Parameters:
dataset (
TrainingDataset) – Dataset to analyze- Return type:
- Returns:
DatasetStats with comprehensive statistics
- kerb.fine_tuning.check_data_quality(dataset)[source]
Check dataset for quality issues.
- Parameters:
dataset (
TrainingDataset) – Dataset to check- Return type:
- Returns:
Dictionary with quality metrics and issues
- kerb.fine_tuning.create_training_config(model, n_epochs=3, batch_size=None, learning_rate_multiplier=None, **kwargs)[source]
Create training configuration.
- Parameters:
- Return type:
- Returns:
TrainingConfig
- kerb.fine_tuning.estimate_training_time(dataset, n_epochs=3, batch_size=8)[source]
Estimate training duration.
Model fine-tuning utilities and large dataset preparation.