Parsing Module

Parsing utilities for LLM output processing and validation.

This module provides comprehensive tools for parsing and validating LLM outputs.

Enums:

ParseMode - Parsing strictness mode (STRICT, LENIENT, BEST_EFFORT) ValidationLevel - Validation strictness level

Data Classes:

ParseResult - Result from parsing operation with data and metadata ValidationResult - Result from validation with errors and warnings

JSON Parsing:

extract_json() - Extract JSON from text with markdown/text artifacts parse_json() - Parse JSON with automatic fixing fix_json() - Fix common JSON formatting issues extract_json_array() - Extract and validate JSON array extract_json_object() - Extract and validate JSON object ensure_json_output() - Extract JSON with fallback default ensure_list_output() - Extract list with fallback default ensure_dict_output() - Extract dict with fallback default

Schema Validation:

validate_json_schema() - Validate against JSON Schema

Pydantic Integration:

parse_to_pydantic() - Parse text to Pydantic model instance pydantic_to_schema() - Convert Pydantic model to JSON Schema validate_pydantic() - Validate data against Pydantic model pydantic_to_function() - Convert Pydantic model to function definition

Function Calling / Tool Use:

format_function_call() - Format function definition for LLMs format_tool_call() - Format tool definition (OpenAI format) parse_function_call() - Parse function call from LLM output format_function_result() - Format function result for LLM

Code and Text Extraction:

extract_code_blocks() - Extract code blocks from markdown extract_xml_tag() - Extract content from XML-style tags extract_markdown_sections() - Extract sections by heading extract_list_items() - Extract list items from markdown parse_markdown_table() - Parse markdown table to dicts

Output Validation:

validate_output() - Comprehensive output validation retry_parse_with_fixes() - Retry parsing with progressive fixes

Utilities:

clean_llm_output() - Clean common LLM output artifacts

Usage Examples:

# Common imports from kerb.parsing import parse_json, extract_code_blocks

# Submodule imports for specialized use from kerb.parsing.json import fix_json from kerb.parsing.pydantic import parse_to_pydantic from kerb.parsing.validation import validate_output

For splitting by delimiter, use: text.split(delimiter) For text truncation, use the preprocessing module:

from kerb.preprocessing import truncate_text

class kerb.parsing.ParseMode(*values)[source]

Bases: Enum

Parsing mode for extracting structured data.

STRICT = 'strict'

LENIENT = 'lenient'

BEST_EFFORT = 'best_effort'

class kerb.parsing.ValidationLevel(*values)[source]

Bases: Enum

Validation strictness level.

NONE = 'none'

BASIC = 'basic'

SCHEMA = 'schema'

STRICT = 'strict'

class kerb.parsing.ParseResult(success, data=None, error=None, fixed=False, original=None, warnings=<factory>)[source]

Bases: object

Result from parsing operation.

success: bool

data: Any = None

error: str | None = None

fixed: bool = False

original: str | None = None

warnings: List[str]

__init__(success, data=None, error=None, fixed=False, original=None, warnings=<factory>)

class kerb.parsing.ValidationResult(valid, errors=<factory>, warnings=<factory>, data=None)[source]

Bases: object

Result from validation operation.

valid: bool

errors: List[str]

warnings: List[str]

data: Any = None

__init__(valid, errors=<factory>, warnings=<factory>, data=None)

kerb.parsing.extract_json(text, mode=ParseMode.LENIENT)[source]

Extract JSON from text that may contain additional content.

This function intelligently extracts JSON objects or arrays from LLM outputs that may include markdown formatting, explanatory text, or other artifacts.

Parameters:

text (str) – Text containing JSON (may have markdown, explanations, etc.)
mode (ParseMode) – Parsing mode - strict, lenient, or best_effort

Returns:

Parsed JSON data and metadata

Return type:

ParseResult

Examples

>>> extract_json('Here is the data: {"name": "John", "age": 30}')
ParseResult(success=True, data={'name': 'John', 'age': 30}, ...)

>>> extract_json('```json\n{"key": "value"}\n```')
ParseResult(success=True, data={'key': 'value'}, ...)

kerb.parsing.parse_json(text, mode=ParseMode.LENIENT)[source]

Parse JSON with automatic fixing for common LLM output issues.

Parameters:

text (str) – JSON text to parse
mode (ParseMode) – Parsing mode - strict, lenient, or best_effort

Returns:

Parsed JSON data and metadata

Return type:

ParseResult

kerb.parsing.fix_json(text)[source]

Attempt to fix common JSON formatting issues in LLM outputs.

Common fixes: - Remove trailing commas - Fix single quotes to double quotes - Remove comments - Fix missing/extra brackets - Handle truncated JSON

Parameters:: text (str) – Potentially malformed JSON text
Returns:: Fixed and parsed JSON if successful
Return type:: ParseResult

kerb.parsing.extract_json_array(text, mode=ParseMode.LENIENT)[source]

Extract a JSON array from text.

Parameters:

text (str) – Text containing JSON array
mode (ParseMode) – Parsing mode

Returns:

Parsed JSON array

Return type:

ParseResult

kerb.parsing.extract_json_object(text, mode=ParseMode.LENIENT)[source]

Extract a JSON object from text.

Parameters:

text (str) – Text containing JSON object
mode (ParseMode) – Parsing mode

Returns:

Parsed JSON object

Return type:

ParseResult

kerb.parsing.ensure_json_output(text, default=None)[source]

Extract JSON from text, returning default if parsing fails.

Parameters:

text (str) – Text containing JSON
default (Any) – Default value if parsing fails

Return type:

Any

Returns:

Parsed JSON or default value

kerb.parsing.ensure_list_output(text, default=None)[source]

Extract JSON array from text, returning default if parsing fails.

Parameters:

text (str) – Text containing JSON array
default (Optional[List]) – Default value if parsing fails

Return type:

List

Returns:

Parsed list or default value

kerb.parsing.ensure_dict_output(text, default=None)[source]

Extract JSON object from text, returning default if parsing fails.

Parameters:

text (str) – Text containing JSON object
default (Optional[Dict]) – Default value if parsing fails

Return type:

Dict

Returns:

Parsed dict or default value

kerb.parsing.validate_json_schema(data, schema)[source]

Validate data against a JSON Schema.

Parameters:

data (Any) – Data to validate (typically a dict or list)
schema (Dict[str, Any]) – JSON Schema definition

Returns:

Validation result with any errors

Return type:

ValidationResult

Examples

>>> schema = {"type": "object", "properties": {"name": {"type": "string"}}}
>>> validate_json_schema({"name": "John"}, schema)
ValidationResult(valid=True, errors=[], data={'name': 'John'})

kerb.parsing.parse_to_pydantic(text, model_class, mode=ParseMode.LENIENT)[source]

Parse text to a Pydantic model instance.

Parameters:

text (str) – Text containing JSON data
model_class (Type) – Pydantic model class to parse into
mode (ParseMode) – Parsing mode

Returns:

Parsed Pydantic model instance

Return type:

ParseResult

Examples

>>> from pydantic import BaseModel
>>> class User(BaseModel):
...     name: str
...     age: int
>>> parse_to_pydantic('{"name": "John", "age": 30}', User)
ParseResult(success=True, data=User(name='John', age=30), ...)

kerb.parsing.pydantic_to_schema(model_class)[source]

Convert a Pydantic model to JSON Schema.

Parameters:: model_class (Type) – Pydantic model class
Returns:: JSON Schema representation
Return type:: Dict[str, Any]

kerb.parsing.validate_pydantic(data, model_class)[source]

Validate data against a Pydantic model.

Parameters:

data (Dict[str, Any]) – Data to validate (typically a dict)
model_class (Type) – Pydantic model class

Returns:

Validation result with any errors

Return type:

ValidationResult

kerb.parsing.pydantic_to_function(model_class, name=None, description=None)[source]

Convert a Pydantic model to a function calling definition.

Parameters:

model_class (Type) – Pydantic model class
name (Optional[str]) – Function name (defaults to model class name)
description (Optional[str]) – Function description (defaults to model docstring)

Returns:

Function calling definition

Return type:

Dict[str, Any]

kerb.parsing.format_function_call(name, description, parameters, required=None)[source]

Format a function definition for LLM function calling.

Parameters:

name (str) – Function name
description (str) – Function description
parameters (Dict[str, Any]) – Parameter schema (JSON Schema format)
required (Optional[List[str]]) – List of required parameter names

Returns:

Formatted function definition

Return type:

Dict[str, Any]

Examples

>>> format_function_call(
...     name="get_weather",
...     description="Get weather for a location",
...     parameters={"location": {"type": "string"}},
...     required=["location"]
... )

kerb.parsing.format_tool_call(name, description, parameters, required=None)[source]

Format a tool definition for LLM tool use (OpenAI format).

Parameters:

name (str) – Tool name
description (str) – Tool description
parameters (Dict[str, Any]) – Parameter schema (JSON Schema format)
required (Optional[List[str]]) – List of required parameter names

Returns:

Formatted tool definition

Return type:

Dict[str, Any]

kerb.parsing.parse_function_call(text, mode=ParseMode.LENIENT)[source]

Parse a function call from LLM output.

Extracts function name and arguments from various formats: - JSON format: {“name”: “func”, “arguments”: {…}} - Plain format: func(arg1=val1, arg2=val2) - Markdown format with code blocks

Parameters:

text (str) – Text containing function call
mode (ParseMode) – Parsing mode

Returns:

Parsed function call with name and arguments

Return type:

ParseResult

kerb.parsing.format_function_result(result, name=None)[source]

Format a function result for returning to the LLM.

Parameters:

result (Any) – Function execution result
name (Optional[str]) – Function name

Returns:

Formatted function result

Return type:

Dict[str, Any]

kerb.parsing.extract_code_blocks(text, language=None)[source]

Extract code blocks from markdown text.

Parameters:

text (str) – Markdown text containing code blocks
language (Optional[str]) – Filter by language (e.g., ‘python’, ‘json’)

Returns:

List of code blocks with ‘language’ and ‘code’ keys

Return type:

List[Dict[str, str]]

Examples

>>> extract_code_blocks('```python\nprint("hello")\n```')
[{'language': 'python', 'code': 'print("hello")'}]

kerb.parsing.extract_xml_tag(text, tag)[source]

Extract content from XML-style tags.

Parameters:

text (str) – Text containing XML tags
tag (str) – Tag name to extract (without < >)

Returns:

List of tag contents

Return type:

List[str]

Examples

>>> extract_xml_tag('<answer>42</answer>', 'answer')
['42']

kerb.parsing.extract_markdown_sections(text, heading_level=2)[source]

Extract sections from markdown by heading level.

Parameters:

text (str) – Markdown text
heading_level (int) – Heading level to split on (1-6)

Returns:

Mapping of heading names to section content

Return type:

Dict[str, str]

kerb.parsing.extract_list_items(text, ordered=False)[source]

Extract list items from markdown text.

Parameters:

text (str) – Markdown text
ordered (bool) – Extract ordered lists (1. 2. 3.) vs unordered (- * +)

Returns:

List items

Return type:

List[str]

kerb.parsing.parse_markdown_table(text)[source]

Parse a markdown table into a list of dictionaries.

Parameters:: text (str) – Markdown table text
Returns:: List of rows as dictionaries
Return type:: List[Dict[str, str]]

Examples

>>> table = '''
... | Name | Age |
... |------|-----|
... | John | 30  |
... | Jane | 25  |
... '''
>>> parse_markdown_table(table)
[{'Name': 'John', 'Age': '30'}, {'Name': 'Jane', 'Age': '25'}]

kerb.parsing.validate_output(text, output_type, schema=None, model_class=None, custom_validator=None)[source]

Validate LLM output against expected format.

Parameters:

text (str) – LLM output text
output_type (str) – Expected type (‘json’, ‘json_array’, ‘json_object’, ‘pydantic’, ‘code’, etc.)
schema (Optional[Dict[str, Any]]) – JSON Schema for validation
model_class (Optional[Type]) – Pydantic model class for validation
custom_validator (Optional[Callable[[Any], bool]]) – Custom validation function

Returns:

Validation result with errors/warnings

Return type:

ValidationResult

kerb.parsing.retry_parse_with_fixes(text, parser_func, max_attempts=3)[source]

Retry parsing with increasingly aggressive fixes.

Parameters:

text (str) – Text to parse
parser_func (Callable[[str], ParseResult]) – Parser function to use
max_attempts (int) – Maximum retry attempts

Returns:

Final parse result

Return type:

ParseResult

kerb.parsing.clean_llm_output(text)[source]

Clean common artifacts from LLM outputs.

Removes: - Markdown code blocks - Leading/trailing whitespace - Common prefixes like “Here is…” or “Sure, here’s…”

Parameters:: text (str) – Raw LLM output
Returns:: Cleaned text
Return type:: str

Output parsing and validation for JSON, structured data, and function calls.