Parsing Module
Parsing utilities for LLM output processing and validation.
This module provides comprehensive tools for parsing and validating LLM outputs.
- Enums:
ParseMode - Parsing strictness mode (STRICT, LENIENT, BEST_EFFORT) ValidationLevel - Validation strictness level
- Data Classes:
ParseResult - Result from parsing operation with data and metadata ValidationResult - Result from validation with errors and warnings
- JSON Parsing:
extract_json() - Extract JSON from text with markdown/text artifacts parse_json() - Parse JSON with automatic fixing fix_json() - Fix common JSON formatting issues extract_json_array() - Extract and validate JSON array extract_json_object() - Extract and validate JSON object ensure_json_output() - Extract JSON with fallback default ensure_list_output() - Extract list with fallback default ensure_dict_output() - Extract dict with fallback default
- Schema Validation:
validate_json_schema() - Validate against JSON Schema
- Pydantic Integration:
parse_to_pydantic() - Parse text to Pydantic model instance pydantic_to_schema() - Convert Pydantic model to JSON Schema validate_pydantic() - Validate data against Pydantic model pydantic_to_function() - Convert Pydantic model to function definition
- Function Calling / Tool Use:
format_function_call() - Format function definition for LLMs format_tool_call() - Format tool definition (OpenAI format) parse_function_call() - Parse function call from LLM output format_function_result() - Format function result for LLM
- Code and Text Extraction:
extract_code_blocks() - Extract code blocks from markdown extract_xml_tag() - Extract content from XML-style tags extract_markdown_sections() - Extract sections by heading extract_list_items() - Extract list items from markdown parse_markdown_table() - Parse markdown table to dicts
- Output Validation:
validate_output() - Comprehensive output validation retry_parse_with_fixes() - Retry parsing with progressive fixes
- Utilities:
clean_llm_output() - Clean common LLM output artifacts
- Usage Examples:
# Common imports from kerb.parsing import parse_json, extract_code_blocks
# Submodule imports for specialized use from kerb.parsing.json import fix_json from kerb.parsing.pydantic import parse_to_pydantic from kerb.parsing.validation import validate_output
For splitting by delimiter, use: text.split(delimiter) For text truncation, use the preprocessing module:
from kerb.preprocessing import truncate_text
- class kerb.parsing.ParseMode(*values)[source]
Bases:
EnumParsing mode for extracting structured data.
- STRICT = 'strict'
- LENIENT = 'lenient'
- BEST_EFFORT = 'best_effort'
- class kerb.parsing.ValidationLevel(*values)[source]
Bases:
EnumValidation strictness level.
- NONE = 'none'
- BASIC = 'basic'
- SCHEMA = 'schema'
- STRICT = 'strict'
- class kerb.parsing.ParseResult(success, data=None, error=None, fixed=False, original=None, warnings=<factory>)[source]
Bases:
objectResult from parsing operation.
- __init__(success, data=None, error=None, fixed=False, original=None, warnings=<factory>)
- class kerb.parsing.ValidationResult(valid, errors=<factory>, warnings=<factory>, data=None)[source]
Bases:
objectResult from validation operation.
- __init__(valid, errors=<factory>, warnings=<factory>, data=None)
- kerb.parsing.extract_json(text, mode=ParseMode.LENIENT)[source]
Extract JSON from text that may contain additional content.
This function intelligently extracts JSON objects or arrays from LLM outputs that may include markdown formatting, explanatory text, or other artifacts.
- Parameters:
- Returns:
Parsed JSON data and metadata
- Return type:
Examples
>>> extract_json('Here is the data: {"name": "John", "age": 30}') ParseResult(success=True, data={'name': 'John', 'age': 30}, ...)
>>> extract_json('```json\n{"key": "value"}\n```') ParseResult(success=True, data={'key': 'value'}, ...)
- kerb.parsing.parse_json(text, mode=ParseMode.LENIENT)[source]
Parse JSON with automatic fixing for common LLM output issues.
- Parameters:
- Returns:
Parsed JSON data and metadata
- Return type:
- kerb.parsing.fix_json(text)[source]
Attempt to fix common JSON formatting issues in LLM outputs.
Common fixes: - Remove trailing commas - Fix single quotes to double quotes - Remove comments - Fix missing/extra brackets - Handle truncated JSON
- Parameters:
text (
str) – Potentially malformed JSON text- Returns:
Fixed and parsed JSON if successful
- Return type:
- kerb.parsing.extract_json_array(text, mode=ParseMode.LENIENT)[source]
Extract a JSON array from text.
- Parameters:
- Returns:
Parsed JSON array
- Return type:
- kerb.parsing.extract_json_object(text, mode=ParseMode.LENIENT)[source]
Extract a JSON object from text.
- Parameters:
- Returns:
Parsed JSON object
- Return type:
- kerb.parsing.ensure_json_output(text, default=None)[source]
Extract JSON from text, returning default if parsing fails.
- kerb.parsing.ensure_list_output(text, default=None)[source]
Extract JSON array from text, returning default if parsing fails.
- kerb.parsing.ensure_dict_output(text, default=None)[source]
Extract JSON object from text, returning default if parsing fails.
- kerb.parsing.validate_json_schema(data, schema)[source]
Validate data against a JSON Schema.
- Parameters:
- Returns:
Validation result with any errors
- Return type:
Examples
>>> schema = {"type": "object", "properties": {"name": {"type": "string"}}} >>> validate_json_schema({"name": "John"}, schema) ValidationResult(valid=True, errors=[], data={'name': 'John'})
- kerb.parsing.parse_to_pydantic(text, model_class, mode=ParseMode.LENIENT)[source]
Parse text to a Pydantic model instance.
- Parameters:
- Returns:
Parsed Pydantic model instance
- Return type:
Examples
>>> from pydantic import BaseModel >>> class User(BaseModel): ... name: str ... age: int >>> parse_to_pydantic('{"name": "John", "age": 30}', User) ParseResult(success=True, data=User(name='John', age=30), ...)
- kerb.parsing.validate_pydantic(data, model_class)[source]
Validate data against a Pydantic model.
- Parameters:
- Returns:
Validation result with any errors
- Return type:
- kerb.parsing.pydantic_to_function(model_class, name=None, description=None)[source]
Convert a Pydantic model to a function calling definition.
- kerb.parsing.format_function_call(name, description, parameters, required=None)[source]
Format a function definition for LLM function calling.
- Parameters:
- Returns:
Formatted function definition
- Return type:
Examples
>>> format_function_call( ... name="get_weather", ... description="Get weather for a location", ... parameters={"location": {"type": "string"}}, ... required=["location"] ... )
- kerb.parsing.format_tool_call(name, description, parameters, required=None)[source]
Format a tool definition for LLM tool use (OpenAI format).
- kerb.parsing.parse_function_call(text, mode=ParseMode.LENIENT)[source]
Parse a function call from LLM output.
Extracts function name and arguments from various formats: - JSON format: {“name”: “func”, “arguments”: {…}} - Plain format: func(arg1=val1, arg2=val2) - Markdown format with code blocks
- Parameters:
- Returns:
Parsed function call with name and arguments
- Return type:
- kerb.parsing.format_function_result(result, name=None)[source]
Format a function result for returning to the LLM.
- kerb.parsing.extract_code_blocks(text, language=None)[source]
Extract code blocks from markdown text.
- Parameters:
- Returns:
List of code blocks with ‘language’ and ‘code’ keys
- Return type:
Examples
>>> extract_code_blocks('```python\nprint("hello")\n```') [{'language': 'python', 'code': 'print("hello")'}]
- kerb.parsing.extract_xml_tag(text, tag)[source]
Extract content from XML-style tags.
- Parameters:
- Returns:
List of tag contents
- Return type:
Examples
>>> extract_xml_tag('<answer>42</answer>', 'answer') ['42']
- kerb.parsing.extract_markdown_sections(text, heading_level=2)[source]
Extract sections from markdown by heading level.
- kerb.parsing.extract_list_items(text, ordered=False)[source]
Extract list items from markdown text.
- kerb.parsing.parse_markdown_table(text)[source]
Parse a markdown table into a list of dictionaries.
- Parameters:
text (
str) – Markdown table text- Returns:
List of rows as dictionaries
- Return type:
Examples
>>> table = ''' ... | Name | Age | ... |------|-----| ... | John | 30 | ... | Jane | 25 | ... ''' >>> parse_markdown_table(table) [{'Name': 'John', 'Age': '30'}, {'Name': 'Jane', 'Age': '25'}]
- kerb.parsing.validate_output(text, output_type, schema=None, model_class=None, custom_validator=None)[source]
Validate LLM output against expected format.
- Parameters:
text (
str) – LLM output textoutput_type (
str) – Expected type (‘json’, ‘json_array’, ‘json_object’, ‘pydantic’, ‘code’, etc.)schema (
Optional[Dict[str,Any]]) – JSON Schema for validationmodel_class (
Optional[Type]) – Pydantic model class for validationcustom_validator (
Optional[Callable[[Any],bool]]) – Custom validation function
- Returns:
Validation result with errors/warnings
- Return type:
- kerb.parsing.retry_parse_with_fixes(text, parser_func, max_attempts=3)[source]
Retry parsing with increasingly aggressive fixes.
- Parameters:
text (
str) – Text to parseparser_func (
Callable[[str],ParseResult]) – Parser function to usemax_attempts (
int) – Maximum retry attempts
- Returns:
Final parse result
- Return type:
- kerb.parsing.clean_llm_output(text)[source]
Clean common artifacts from LLM outputs.
Removes: - Markdown code blocks - Leading/trailing whitespace - Common prefixes like “Here is…” or “Sure, here’s…”
Output parsing and validation for JSON, structured data, and function calls.