Document Module

Document loading and processing utilities for LLM applications.

This module provides comprehensive tools for working with various document formats.

Common Usage:

# Load documents (top-level) from kerb.document import load_document, Document

# Specialized loaders (submodule) from kerb.document.loaders import load_text, load_markdown

# Utilities (submodule) from kerb.document.utils import detect_format, load_directory

# Text processing (submodules) from kerb.document.extractors import extract_text_from_html from kerb.document.cleaners import clean_text from kerb.document.preprocessors import preprocess_pdf_text from kerb.document.metadata import extract_metadata

Document Loading:

load_document() - Load any supported document (auto-detects format)

Submodules:

loaders - Format-specific document loaders (PDF, DOCX, HTML, etc.) utils - Utilities for format detection, batch loading, and merging extractors - Text extraction from various formats cleaners - Text cleaning and normalization preprocessors - Format-specific preprocessing metadata - Metadata and entity extraction

Data Classes:

Document - Document with content and metadata (from kerb.core.types) DocumentFormat - Enum of supported formats (from kerb.core.types)

class kerb.document.Document(content, metadata=<factory>, id=None, source=None, format=DocumentFormat.UNKNOWN, score=0.0, page_content=None)[source]

Bases: object

Universal document representation across the toolkit.

Consolidates the Document classes from document/ and retrieval/ packages to provide a single, consistent document representation.

content: The text content of the document

metadata: Additional metadata about the document

id: Optional unique identifier for the document

source: Optional source path or URL where document was loaded from

format: Document format (defaults to UNKNOWN)

score: Relevance score (used in retrieval contexts, defaults to 0.0)

page_content: Optional list of content per page (for multi-page documents)

Examples

>>> # Simple document
>>> doc = Document(content="Hello, world!")

>>> # Document with metadata
>>> doc = Document(
...     content="Important document",
...     metadata={"author": "John", "created": "2025-01-01"},
...     source="doc.txt"
... )

>>> # Retrieval result with score
>>> doc = Document(
...     id="doc_123",
...     content="Relevant content",
...     score=0.95
... )

content: str

metadata: Dict[str, Any]

id: str | None = None

source: str | None = None

format: DocumentFormat = 'unknown'

score: float = 0.0

page_content: List[str] | None = None

__len__()[source]

Return the length of the document content.

Return type:: int

to_dict()[source]

Convert document to dictionary.

Return type:: Dict[str, Any]
Returns:: Dictionary representation of the document

classmethod from_dict(data)[source]

Create document from dictionary.

Parameters:: data (Dict[str, Any]) – Dictionary with document data
Return type:: Document
Returns:: New Document instance

__repr__()[source]

String representation of the document.

Return type:: str

__init__(content, metadata=<factory>, id=None, source=None, format=DocumentFormat.UNKNOWN, score=0.0, page_content=None)

class kerb.document.DocumentFormat(*values)[source]

Bases: Enum

Supported document formats.

PDF = 'pdf'

DOCX = 'docx'

DOC = 'doc'

HTML = 'html'

MARKDOWN = 'markdown'

TXT = 'txt'

CSV = 'csv'

JSON = 'json'

XML = 'xml'

RTF = 'rtf'

ODT = 'odt'

EPUB = 'epub'

UNKNOWN = 'unknown'

kerb.document.load_document(file_path, **kwargs)[source]

Load a document from file, automatically detecting format.

This is the main entry point for loading documents. It detects the format and delegates to the appropriate loader.

Parameters:

file_path (str) – Path to the document file
**kwargs – Additional arguments passed to format-specific loaders

Returns:

Loaded document with content and metadata

Return type:

Document

Raises:

FileNotFoundError – If file doesn’t exist
ValueError – If format is not supported

Examples

>>> doc = load_document("report.pdf")
>>> print(doc.content[:100])

>>> doc = load_document("data.csv", parse_as_dict=True)
>>> print(doc.metadata['rows'])

kerb.document.load_text(file_path, encoding='utf-8')[source]

Load a plain text file.

Parameters:

file_path (str) – Path to text file
encoding (str) – Text encoding. Defaults to ‘utf-8’.

Returns:

Loaded document

Return type:

Document

Examples

>>> doc = load_text("notes.txt")
>>> print(doc.content)

kerb.document.load_markdown(file_path, extract_frontmatter=True)[source]

Load a Markdown file.

Parameters:

file_path (str) – Path to markdown file
extract_frontmatter (bool) – Extract YAML frontmatter if present

Returns:

Loaded document with frontmatter in metadata

Return type:

Document

Examples

>>> doc = load_markdown("README.md")
>>> if 'frontmatter' in doc.metadata:
...     print(doc.metadata['frontmatter'])

kerb.document.load_json(file_path, as_string=False)[source]

Load a JSON file.

Parameters:

file_path (str) – Path to JSON file
as_string (bool) – If True, return formatted JSON as string content. If False, store parsed object in metadata.

Returns:

Loaded document

Return type:

Document

Examples

>>> doc = load_json("data.json", as_string=True)
>>> print(doc.content)

>>> doc = load_json("config.json")
>>> print(doc.metadata['json_data'])

kerb.document.load_csv(file_path, parse_as_dict=True, encoding='utf-8')[source]

Load a CSV file.

Parameters:

file_path (str) – Path to CSV file
parse_as_dict (bool) – Parse CSV and store structured data in metadata
encoding (str) – Text encoding

Returns:

Loaded document with CSV data in metadata

Return type:

Document

Examples

>>> doc = load_csv("data.csv")
>>> rows = doc.metadata['rows']
>>> headers = doc.metadata['headers']

kerb.document.load_xml(file_path, encoding='utf-8')[source]

Load an XML file.

Parameters:

file_path (str) – Path to XML file
encoding (str) – Text encoding

Returns:

Loaded document

Return type:

Document

Examples

>>> doc = load_xml("data.xml")
>>> print(doc.content)

kerb.document.load_html(file_path, extract_text=True, encoding='utf-8')[source]

Load an HTML file.

Parameters:

file_path (str) – Path to HTML file
extract_text (bool) – If True, extract plain text from HTML
encoding (str) – Text encoding

Returns:

Loaded document

Return type:

Document

Examples

>>> doc = load_html("page.html", extract_text=True)
>>> print(doc.content)  # Plain text without HTML tags

kerb.document.load_pdf(file_path, extract_images=False)[source]

Load a PDF file.

Requires: pypdf or PyPDF2 package

Parameters:

file_path (str) – Path to PDF file
extract_images (bool) – Whether to extract image information

Returns:

Loaded document with page-by-page content

Return type:

Document

Examples

>>> doc = load_pdf("report.pdf")
>>> print(f"Pages: {doc.metadata['num_pages']}")
>>> print(doc.content)  # All pages concatenated

kerb.document.load_docx(file_path)[source]

Load a DOCX file.

Requires: python-docx package

Parameters:: file_path (str) – Path to DOCX file
Returns:: Loaded document
Return type:: Document

Examples

>>> doc = load_docx("report.docx")
>>> print(doc.content)

kerb.document.detect_format(file_path)[source]

Detect document format from file extension.

Parameters:: file_path (str) – Path to the file
Returns:: Detected format enum
Return type:: DocumentFormat

Examples

>>> detect_format("document.pdf")
DocumentFormat.PDF

>>> detect_format("notes.md")
DocumentFormat.MARKDOWN

kerb.document.is_supported_format(file_path)[source]

Check if file format is supported.

Parameters:: file_path (str) – Path to the file
Returns:: True if format is supported
Return type:: bool

kerb.document.load_directory(directory_path, pattern='*', recursive=False, max_files=None)[source]

Load all supported documents from a directory.

Parameters:

directory_path (str) – Path to directory
pattern (str) – File pattern to match (e.g., “.pdf”, “.txt”)
recursive (bool) – Search subdirectories
max_files (Optional[int]) – Maximum number of files to load

Returns:

List of loaded documents

Return type:

List[Document]

Examples

>>> docs = load_directory("./documents", pattern="*.pdf")
>>> print(f"Loaded {len(docs)} documents")

>>> docs = load_directory("./data", recursive=True, max_files=100)

kerb.document.load_from_url(url, timeout=30, max_size_mb=100, max_retries=3, **kwargs)[source]

Load document from a URL.

Requires: requests package

Parameters:

url (str) – URL to fetch document from
timeout (int) – Request timeout in seconds. Defaults to 30.
max_size_mb (float) – Maximum file size in MB. Defaults to 100.
max_retries (int) – Maximum number of retry attempts. Defaults to 3.
**kwargs – Additional arguments for requests.get()

Returns:

Loaded document

Return type:

Document

Raises:

ValueError – If content exceeds max_size_mb
requests.exceptions.Timeout – If request times out
requests.exceptions.HTTPError – If HTTP error occurs

Examples

>>> doc = load_from_url("https://example.com/document.pdf")
>>> print(doc.content)

>>> # Custom timeout and size limit
>>> doc = load_from_url("https://example.com/large.pdf",
...                     timeout=60, max_size_mb=200)

async kerb.document.load_from_url_async(url, timeout=30, max_size_mb=100, max_retries=3, **kwargs)[source]

Load document from a URL asynchronously.

Requires: aiohttp package

Parameters:

url (str) – URL to fetch document from
timeout (int) – Request timeout in seconds. Defaults to 30.
max_size_mb (float) – Maximum file size in MB. Defaults to 100.
max_retries (int) – Maximum number of retry attempts. Defaults to 3.
**kwargs – Additional arguments for aiohttp.ClientSession.get()

Returns:

Loaded document

Return type:

Document

Raises:

ValueError – If content exceeds max_size_mb
asyncio.TimeoutError – If request times out
aiohttp.ClientError – If HTTP error occurs

Examples

>>> import asyncio
>>> doc = asyncio.run(load_from_url_async("https://example.com/document.pdf"))
>>> print(doc.content)

kerb.document.merge_documents(documents, separator='\\n\\n---\\n\\n')[source]

Merge multiple documents into one.

Parameters:

documents (List[Document]) – Documents to merge
separator (str) – Separator between documents

Returns:

Merged document

Return type:

Document

Examples

>>> doc1 = Document(content="First doc", metadata={"id": 1})
>>> doc2 = Document(content="Second doc", metadata={"id": 2})
>>> merged = merge_documents([doc1, doc2])
>>> print(merged.content)

kerb.document.extract_text_from_html(html, remove_scripts=True)[source]

Extract plain text from HTML content.

Parameters:

html (str) – HTML content
remove_scripts (bool) – Remove script and style tags

Returns:

Extracted plain text

Return type:

str

Examples

>>> html = '<html><body><p>Hello World</p></body></html>'
>>> extract_text_from_html(html)
'Hello World'

kerb.document.strip_markdown(text)[source]

Remove Markdown formatting from text.

Parameters:: text (str) – Markdown text
Returns:: Plain text without Markdown formatting
Return type:: str

Examples

>>> strip_markdown("# Hello **World**")
'Hello World'

kerb.document.split_into_sentences(text)[source]

Split text into sentences.

Parameters:: text (str) – Text to split
Returns:: List of sentences
Return type:: List[str]

Examples

>>> split_into_sentences("Hello world. This is a test!")
['Hello world.', 'This is a test!']

kerb.document.split_into_paragraphs(text)[source]

Split text into paragraphs.

Parameters:: text (str) – Text to split
Returns:: List of paragraphs
Return type:: List[str]

Examples

>>> split_into_paragraphs("Para 1\n\nPara 2\n\nPara 3")
['Para 1', 'Para 2', 'Para 3']

kerb.document.clean_text(text, normalize_whitespace=True, remove_urls=False, remove_emails=False, remove_special_chars=False, lowercase=False)[source]

Clean and normalize text.

Parameters:

text (str) – Text to clean
normalize_whitespace (bool) – Normalize whitespace to single spaces
remove_urls (bool) – Remove URLs
remove_emails (bool) – Remove email addresses
remove_special_chars (bool) – Remove special characters
lowercase (bool) – Convert to lowercase

Returns:

Cleaned text

Return type:

str

Examples

>>> text = "Check   out https://example.com  for more info!"
>>> clean_text(text, normalize_whitespace=True, remove_urls=True)
'Check out for more info!'

kerb.document.remove_extra_newlines(text, max_consecutive=2)[source]

Remove excessive newlines from text.

Parameters:

text (str) – Text to process
max_consecutive (int) – Maximum consecutive newlines to keep

Returns:

Text with limited newlines

Return type:

str

Examples

>>> remove_extra_newlines("Hello\n\n\n\nWorld", max_consecutive=2)
'Hello\n\nWorld'

kerb.document.preprocess_pdf_text(text)[source]

Preprocess text extracted from PDF.

PDFs often have formatting artifacts like broken lines, extra spaces, etc.

Parameters:: text (str) – Text extracted from PDF
Returns:: Cleaned text
Return type:: str

Examples

>>> pdf_text = "This is a sen-\ntence with line break."
>>> preprocess_pdf_text(pdf_text)
'This is a sentence with line break.'

kerb.document.preprocess_html_text(html)[source]

Preprocess HTML to extract clean text.

Parameters:: html (str) – HTML content
Returns:: Cleaned text
Return type:: str

Examples

>>> html = '<div>Hello <span>World</span></div>'
>>> preprocess_html_text(html)
'Hello World'

kerb.document.preprocess_markdown(text, keep_structure=True)[source]

Preprocess Markdown text.

Parameters:

text (str) – Markdown text
keep_structure (bool) – Keep headings and structure markers

Returns:

Processed text

Return type:

str

Examples

>>> md = "# Title\n\nSome **bold** text"
>>> preprocess_markdown(md, keep_structure=False)
'Title\n\nSome bold text'

kerb.document.extract_metadata(file_path)[source]

Extract metadata from a file.

Parameters:: file_path (str) – Path to file
Returns:: Extracted metadata
Return type:: Dict[str, Any]

Examples

>>> metadata = extract_metadata("document.pdf")
>>> print(metadata['size'], metadata['created'])

kerb.document.extract_document_stats(text)[source]

Extract statistics from document text.

Parameters:: text (str) – Document text
Returns:: Document statistics
Return type:: Dict[str, int]

Examples

>>> stats = extract_document_stats("Hello world. This is a test.")
>>> print(stats['word_count'], stats['sentence_count'])

kerb.document.extract_urls(text)[source]

Extract URLs from text.

Parameters:: text (str) – Text to extract URLs from
Returns:: List of URLs
Return type:: List[str]

Examples

>>> extract_urls("Visit https://example.com and www.test.com")
['https://example.com', 'www.test.com']

kerb.document.extract_emails(text)[source]

Extract email addresses from text.

Parameters:: text (str) – Text to extract emails from
Returns:: List of email addresses
Return type:: List[str]

Examples

>>> extract_emails("Contact us at info@example.com or sales@test.org")
['info@example.com', 'sales@test.org']

kerb.document.extract_dates(text)[source]

Extract dates from text (simple patterns).

Parameters:: text (str) – Text to extract dates from
Returns:: List of potential date strings
Return type:: List[str]

Examples

>>> extract_dates("Meeting on 2024-01-15 and 01/20/2024")
['2024-01-15', '01/20/2024']

kerb.document.extract_phone_numbers(text)[source]

Extract phone numbers from text (US format).

Parameters:: text (str) – Text to extract phone numbers from
Returns:: List of phone numbers
Return type:: List[str]

Examples

>>> extract_phone_numbers("Call (555) 123-4567 or 555-987-6543")
['(555) 123-4567', '555-987-6543']

Document loading and processing for PDFs, web pages, and more.